HD-EPIC: A Highly-Detailed Egocentric Video Dataset

Published 6 Feb 2025 in cs.CV | (2502.04144v2)

Abstract: We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HDEPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments. We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing the capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful long-context Gemini Pro only achieves 38.5% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long-term video-object segmentation on HD-EPIC. HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 413 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per minute of our unscripted videos.

Abstract PDF Upgrade to Chat

Summary

The paper presents HD-EPIC, a highly-detailed egocentric video dataset featuring 41 hours of footage, 59,000 actions, and dense multi-modal annotations to challenge video understanding models.
HD-EPIC includes a challenging Visual Question Answering benchmark where even top models like Gemini Pro achieved only 38.5% accuracy, indicating limitations of current video-language understanding.
The dataset serves as a crucial asset for research by providing a realistic testbed that necessitates advancements in handling long-term dependencies, overlapping tasks, and real-world unpredictabilities in egocentric video, moving beyond controlled environments.

An Overview of the HD-EPIC Dataset for Egocentric Video Understanding

The paper presents HD-EPIC, a highly-detailed egocentric video dataset targeting complex video understanding tasks. This dataset describes kitchen-based activities in in-the-wild environments, challenging traditional video recognition models with its dense annotations and realistic settings.

HD-EPIC encompasses 41 hours of video footage across nine diverse kitchens, representing unscripted recordings over multiple days. It includes annotations for 69 distinct recipes, 59,000 fine-grained actions, 51,000 audio events, 20,000 object movements, and 37,000 hand and object masks, making it one of the most richly annotated datasets to date in the domain of egocentric vision.

The dataset details multiple modalities:

Recipe Steps and Ingredients: Recipes are temporally tracked, linking preparation and execution steps, with added ingredients' nutritional values documented for dynamic nutritional analysis.
Fine-Grained Actions: Transcriptions of narrated actions are parsed for verbs, nouns, and hand interactions, enhancing the dataset’s utility for tasks requiring closed vocabulary action recognition.
3D Grounding: It employs digital twinning of kitchen environments to ground annotations in three-dimensional space, tracking objects, their movements, and gaze interactions to reflect real-world dynamics accurately.

Notably, HD-EPIC supports various benchmarks to probe the limitations of current video-LLMs, such as a challenging Visual Question Answering (VQA) benchmark comprising 26,000 questions. In this benchmark, the best-performing model, Gemini Pro, achieved an accuracy of only 38.5%, highlighting the dataset's complexity and potential to expose weaknesses in existing models.

The implications of HD-EPIC are significant, offering a more comprehensive testbed for evaluating the perceptions and interpretations of AI models dealing with video data. Practically, testers can expect more realistic and nuanced interactions reflective of daily life rather than controlled or synthetic environments. Theoretically, the dataset necessitates advancements in handling long-term dependencies and overlapping tasks, as real-world actions often span multiple categories simultaneously.

Future developments could include leveraging HD-EPIC for training models capable of understanding cross-domain contexts, improving model robustness in handling occlusions and dynamic scene changes, and fostering advances in anticipating future actions from current gaze and object trajectories.

In summary, HD-EPIC stands as a crucial asset for the research community, primed to propel advancements in the comprehensive understanding of egocentric video data across various contexts and modalities. It signifies a pivotal shift from synthetic, controlled datasets to those capturing the intricacies and unpredictabilities of real-world environments.