Scaling Egocentric Vision: The EPIC-KITCHENS Dataset (1804.02748v2)

Published 8 Apr 2018 in cs.CV

Abstract: First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict nonscripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labeled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. Dataset and Project page: http://epic-kitchens.github.io

Authors (11)

Dima Damen (83 papers)
Hazel Doughty (22 papers)
Giovanni Maria Farinella (50 papers)
Sanja Fidler (184 papers)
Antonino Furnari (46 papers)
Evangelos Kazakos (13 papers)
Davide Moltisanti (15 papers)
Jonathan Munro (6 papers)
Toby Perrett (18 papers)
Will Price (11 papers)
Michael Wray (29 papers)

Citations (922)

View on Semantic Scholar

Summary

The paper presents EPIC-KITCHENS, a large-scale egocentric dataset featuring 11.5 million frames and extensive, detailed annotations.
It employs models like Faster R-CNN and TSN, reporting 35–40% mAP for object detection and 20.5% to 10.9% accuracy for action recognition across varying contexts.
The dataset’s natural, unscripted recordings pave the way for advancements in assistive robotics, augmented reality, and personalized AI systems.

An Overview of "Scaling Egocentric Vision: The EPIC-COLOR Dataset"

The paper "Scaling Egocentric Vision: The EPIC-COLOR Dataset" presents a comprehensive initiative to advance the domain of egocentric video understanding. Recognizing the limitations imposed by the scarcity of extensive datasets, the authors introduce EPIC-COLOR, a large-scale benchmark dataset specifically curated for first-person vision research.

Dataset Collection

EPIC-COLOR was compiled by recording 55 hours of egocentric video from 32 participants operating in their native kitchen environments across four cities in North America and Europe. This dataset, which encompasses 11.5 million frames, captures unscripted daily activities, thereby reflecting natural and diverse cooking styles and interactivity patterns. One unique aspect of this dataset is the participants' narration of their activities post-recording, adding valuable insight into the intent and context of observed actions.

Annotations and Challenges

The dataset is meticulously annotated, featuring:

39,596 action segments
454,255 object bounding boxes

The dense labelling of EPIC-COLOR is facilitated through a combination of participant narrations and crowdsourced annotations. These annotations are segmented into three primary challenges:

Object Detection: Identifying and localizing objects within the egocentric video frames.
Action Recognition: Classifying observed action segments into predefined classes based on participants' narrations.
Action Anticipation: Predicting future actions given current observations.

Object Detection

For the object detection challenge, the authors implement Faster R-CNN with a ResNet-101 backbone on frames sampled at 2 fps. This challenge is particularly demanding due to the variability and occlusion common in egocentric videos. The dataset supports ongoing evaluation, including scenarios with many-shot and few-shot classes, to assess models’ generalizing capabilities.

Action Recognition

In action recognition, the key objective is to classify the action segments into compound action classes composed of both verbs and nouns. Temporal Segment Network (TSN), a state-of-the-art model for action recognition, was trained on both RGB frames and optical flow inputs. The combined predictions from these modalities provided improved accuracy in recognizing complex action sequences.

Action Anticipation

The action anticipation challenge extends beyond recognition to forecast future actions based on a specified observation segment. Here, the authors elucidate on training TSN to predict actions 1 second before their occurrence. This anticipatory capability is pivotal for applications like assistive tech and human-computer interaction.

Baseline Results

Baseline evaluation results across these challenges show that current methods exhibit substantial scope for improvement:

Object Detection: Mean Average Precision (mAP) at IoU > 0.5 hovers around 35-40%, indicating room for enhancement in handling egocentric data intricacies.
Action Recognition: Top-1 accuracy for combined verb-noun classes is reported at 20.5% in familiar environments, dropping to 10.9% in unseen environments, emphasizing the challenge of action recognition in novel contexts.
Action Anticipation: This challenge yields even lower accuracy, reflecting the inherent difficulty of predicting future actions in dynamic, first-person scenarios.

Implications and Future Directions

The introduction of EPIC-COLOR marks a significant step forward in egocentric vision research, offering an extensive, richly annotated dataset that captures the complexity of everyday human interactions with their environment. The dataset not only facilitates advancements in object detection and action recognition but also opens avenues for exploring higher-level understanding tasks such as visual dialogue, goal-driven interaction modeling, and skill assessment.

Moving forward, this dataset is expected to drive innovation in developing more sophisticated algorithms capable of real-time inference and better generalization to unseen scenarios. Such improvements have profound implications for fields like assistive robotics, augmented reality, and personal AI assistants, pushing the boundaries of how machines perceive and interact with the human world.

Conclusion

The EPIC-COLOR dataset represents a valuable resource for the computer vision community, addressing critical gaps in dataset scale and diversity for egocentric video understanding. It sets a robust foundation for subsequent research, encouraging the community to tackle the cited challenges with novel, efficient, and effective methodologies conducive to real-world applications.

PDF Markdown

Related Papers

GitHub

Redirecting to EPIC Kitchens 2024

YouTube

Show All Videos