- The paper presents HD-EPIC, a highly-detailed egocentric video dataset featuring 41 hours of footage, 59,000 actions, and dense multi-modal annotations to challenge video understanding models.
- HD-EPIC includes a challenging Visual Question Answering benchmark where even top models like Gemini Pro achieved only 38.5% accuracy, indicating limitations of current video-language understanding.
- The dataset serves as a crucial asset for research by providing a realistic testbed that necessitates advancements in handling long-term dependencies, overlapping tasks, and real-world unpredictabilities in egocentric video, moving beyond controlled environments.
An Overview of the HD-EPIC Dataset for Egocentric Video Understanding
The paper presents HD-EPIC, a highly-detailed egocentric video dataset targeting complex video understanding tasks. This dataset describes kitchen-based activities in in-the-wild environments, challenging traditional video recognition models with its dense annotations and realistic settings.
HD-EPIC encompasses 41 hours of video footage across nine diverse kitchens, representing unscripted recordings over multiple days. It includes annotations for 69 distinct recipes, 59,000 fine-grained actions, 51,000 audio events, 20,000 object movements, and 37,000 hand and object masks, making it one of the most richly annotated datasets to date in the domain of egocentric vision.
The dataset details multiple modalities:
- Recipe Steps and Ingredients: Recipes are temporally tracked, linking preparation and execution steps, with added ingredients' nutritional values documented for dynamic nutritional analysis.
- Fine-Grained Actions: Transcriptions of narrated actions are parsed for verbs, nouns, and hand interactions, enhancing the dataset’s utility for tasks requiring closed vocabulary action recognition.
- 3D Grounding: It employs digital twinning of kitchen environments to ground annotations in three-dimensional space, tracking objects, their movements, and gaze interactions to reflect real-world dynamics accurately.
Notably, HD-EPIC supports various benchmarks to probe the limitations of current video-LLMs, such as a challenging Visual Question Answering (VQA) benchmark comprising 26,000 questions. In this benchmark, the best-performing model, Gemini Pro, achieved an accuracy of only 38.5%, highlighting the dataset's complexity and potential to expose weaknesses in existing models.
The implications of HD-EPIC are significant, offering a more comprehensive testbed for evaluating the perceptions and interpretations of AI models dealing with video data. Practically, testers can expect more realistic and nuanced interactions reflective of daily life rather than controlled or synthetic environments. Theoretically, the dataset necessitates advancements in handling long-term dependencies and overlapping tasks, as real-world actions often span multiple categories simultaneously.
Future developments could include leveraging HD-EPIC for training models capable of understanding cross-domain contexts, improving model robustness in handling occlusions and dynamic scene changes, and fostering advances in anticipating future actions from current gaze and object trajectories.
In summary, HD-EPIC stands as a crucial asset for the research community, primed to propel advancements in the comprehensive understanding of egocentric video data across various contexts and modalities. It signifies a pivotal shift from synthetic, controlled datasets to those capturing the intricacies and unpredictabilities of real-world environments.