Recognising actions in videos typically requires the start and end times of each action to be known. This labelling is not only subjective, but also expensive to acquire. We propose a method that uses single timestamps located around each action instance, in untrimmed videos, as opposed to the standard start/end times. We replace expensive action bounds with sampling distributions initialised from these timestamps. We then use the classifier's response to iteratively update the sampling distributions. We demonstrate that these distributions converge to the location and extent of discriminative action segments. We evaluate our method on three datasets for fine-grained recognition and show that single timestamps offer a reasonable compromise between recognition performance and labelling effort, performing comparably to full temporal supervision.
For the first version of EPIC-KITCHENS we used a live-commentary approach (people narrating actions without pausing the video) to annotate the untrimmed videos. The narration timestamps served as an initial ground to collect action boundaries and object boxes. As we collected the dataset, we wondered if the rough narration timestamps could be used to supervise action recognition models. We showed this is possible in this paper. While working on the paper we realised we were missing many actions in the videos. This was because annotators did not pause the videos while they were speaking, and as a result they were naturally not able to narrate actions that were happening as they spoke. Moreover, the timestamps were often not well aligned with the videos. For the EPIC-KITCHENS extension we wanted to fix these issues, i.e. we wanted dense and precise narration timestamps. The EPIC Narrator was thus born! The tool allows easy annotations of actions in video through narration. Importantly, the narrator dramatically increased annotation density and precision. You can download the EPIC Narrator here. You can find more about the benefits of using the narrator in our arXiv paper.
Research supported by: EPSRC LOCATE (EP/N033779/1) and EPSRC Doctoral Training Programme at the University of Bristol.
Part of this work was carried out during Davide's summer internship at the University of Toronto, working with Dr. Fidler.