Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation (2411.08380v1)

Published 13 Nov 2024 in cs.CV

Abstract: Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments. Within this context, egocentric video generation, which centers on the human perspective, holds significant potential for enhancing applications in virtual reality, augmented reality, and gaming. However, the generation of egocentric videos presents substantial challenges due to the dynamic nature of egocentric viewpoints, the intricate diversity of actions, and the complex variety of scenes encountered. Existing datasets are inadequate for addressing these challenges effectively. To bridge this gap, we present EgoVid-5M, the first high-quality dataset specifically curated for egocentric video generation. EgoVid-5M encompasses 5 million egocentric video clips and is enriched with detailed action annotations, including fine-grained kinematic control and high-level textual descriptions. To ensure the integrity and usability of the dataset, we implement a sophisticated data cleaning pipeline designed to maintain frame consistency, action coherence, and motion smoothness under egocentric conditions. Furthermore, we introduce EgoDreamer, which is capable of generating egocentric videos driven simultaneously by action descriptions and kinematic control signals. The EgoVid-5M dataset, associated action annotations, and all data cleansing metadata will be released for the advancement of research in egocentric video generation.

Citations (1)

Summary

  • The paper introduces EgoVid-5M, a dataset of 5 million 1080p egocentric videos with detailed kinematic and text annotations to advance generation research.
  • This large-scale dataset features high resolution, extensive VIO kinematic and LLM text annotations, rigorous cleaning, and broad scene coverage across diverse activities.
  • Experiments demonstrate that utilizing EgoVid-5M significantly improves egocentric video generation models' performance on metrics like semantic consistency and action alignment.

Overview of EgoVid-5M: A Dataset for Egocentric Video Generation

The paper introduces EgoVid-5M, a comprehensive dataset curated specifically for egocentric video generation tasks. Recognizing the limitations of existing datasets, particularly their inability to effectively address the dynamic and variable nature of egocentric video generation, EgoVid-5M is positioned as an innovative resource to fill this gap. This dataset comprises five million high-quality, annotated video clips, with a focus on enabling detailed kinematic control and offering rich textual descriptions.

Key Features of EgoVid-5M

EgoVid-5M distinguishes itself through several critical features:

  1. Resolution and Quality: The dataset offers videos at a full 1080p resolution, ensuring high visual fidelity, which is crucial for training effective video generation models.
  2. Detailed Annotations: A significant strength of EgoVid-5M is its extensive action annotations. This dataset incorporates both low-level kinematic data, provided through Visual-Inertial Odometry (VIO), and high-level textual descriptions generated via LLMs. This dual-layer annotation strategy enhances the utility of the dataset for training models to generate coherent and semantically rich egocentric videos.
  3. Robust Data Cleaning: The paper emphasizes a rigorous data cleaning strategy that maintains frame consistency, action coherence, and motion smoothness. This meticulous curation is designed to optimize the dataset for generative training applications, distinguishing it from other datasets burdened with noisy data.
  4. Comprehensive Scene Coverage: EgoVid-5M captures a wide array of scenarios, from household activities to sports and skilled operations, thereby covering a broad spectrum of potential egocentric experiences.
  5. Comparison with Other Datasets: A comparison table highlights how EgoVid-5M stands out in terms of its tailored approach to egocentric video generation, unlike other datasets like Ego4D or general video datasets that may lack specific features crucial for this task.

Methodology and Innovations

The dataset is accompanied by the introduction of a new model, EgoDreamer, which leverages the annotations provided by EgoVid-5M to drive egocentric video generation. EgoDreamer employs:

  • Unified Action Encoder (UAE): This encoder processes both textual descriptions and kinematic signals, enabling a more nuanced representation of actions.
  • Adaptive Alignment: This mechanism integrates action conditions into the video generation process, enhancing control over the generated outputs.

Experimental Validation

The experiments conducted using various video generation models, including U-Net and DiT, demonstrate that incorporating EgoVid-5M significantly improves the models' performance across multiple evaluation metrics such as semantic consistency, action alignment, and visual clarity.

Implications and Future Directions

The introduction of EgoVid-5M has significant implications for the fields of virtual reality, augmented reality, and interactive gaming, where the demand for realistic and contextually accurate egocentric video generation is growing. The dataset’s detailed annotations and comprehensive cleaning protocols provide a solid foundation for advancing research in world simulations that require human-centric perspectives.

In future work, expanding the dataset to include even more diverse scenarios or integrating additional sensory data could further enhance its applicability. Additionally, the research community now has the opportunity to explore different cleaning strategies and how they impact generative performance, thanks to the publicly available annotations and metadata.

EgoVid-5M sets a new standard for datasets in this domain, offering a vital resource for unlocking the potential of egocentric video generation across various applications. As such, it is likely to spur further advancements in the generation of contextually relevant and high-quality egocentric videos.