Ego4D: Around the World in 3,000 Hours of Egocentric Video (2110.07058v3)

Published 13 Oct 2021 in cs.CV and cs.AI

Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/

Authors (85)

Kristen Grauman (136 papers)
Andrew Westbury (4 papers)
Eugene Byrne (2 papers)
Zachary Chavis (3 papers)
Antonino Furnari (46 papers)
Rohit Girdhar (43 papers)
Jackson Hamburger (2 papers)
Hao Jiang (230 papers)
Miao Liu (98 papers)
Xingyu Liu (56 papers)
Miguel Martin (53 papers)
Tushar Nagarajan (33 papers)
Ilija Radosavovic (19 papers)
Santhosh Kumar Ramakrishnan (12 papers)
Fiona Ryan (13 papers)
Jayant Sharma (2 papers)
Michael Wray (29 papers)
Mengmeng Xu (27 papers)
Eric Zhongcong Xu (6 papers)
Chen Zhao (249 papers)

Citations (841)

View on Semantic Scholar

Summary

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Introduction

The Ego4D dataset is a large-scale egocentric video collection created to drive advancements in understanding first-person visual experiences. It aims to provide a rich resource for researchers and to catalyze innovations in computer vision, robotics, and augmented reality.

Dataset Overview

Volume and Diversity:

The dataset includes 3,670 hours of video captured by 931 unique participants from 74 locations across 9 countries. This includes various scenarios such as household activities, social interactions, outdoor events, and workplace settings. The collection methods emphasize diversity and realism, endeavoring to capture unscripted, real-world activities.

Data Modalities:

While the core of the dataset is video, it also includes:

Audio: For capturing conversations and ambient sounds.
3D Meshes: Scans of environments to contextualize interactions.
Eye Gaze: Where participants were looking.
Stereo Video and Multi-cam: Multiple perspectives of the same event.

Privacy and Ethics:

To ensure ethical compliance, the dataset follows rigorous privacy standards. Participants provided informed consent, and videos were reviewed for de-identification of personally identifiable information.

Benchmark Suite

Ego4D introduces a benchmark suite focused on understanding and leveraging first-person visual data, divided into five core tasks:

Episodic Memory

Goal: Answer queries about past events captured in first-person video.

Tasks:

Natural Language Queries (NLQ): Find when an event described in text occurred in the past video.
Visual Queries (VQ): Locate objects in frames from past video based on provided images.
Moment Queries (MQ): Identify all instances of a specific activity in the video.

Implications: Advances in these tasks will enhance capabilities in personal assistance technologies, allowing systems to act as an augmented memory for users.

Hands and Objects

Goal: Understand how users interact with objects, focusing on changes in their state.

Tasks:

Temporal Localization: Identify keyframes where state changes start.
Object Detection: Detect objects undergoing changes.
State Change Classification: Determine whether a state change is occurring.

Implications: This is vital for applications in instructional robots and augmented reality, where understanding object interaction is crucial.

Audio-Visual Diarization

Goal: Analyze conversations to determine who is speaking and when.

Tasks:

Speaker Localization and Tracking: Identify and track speakers in the visual field.
Active Speaker Detection: Detect which tracked speakers are currently speaking.
Speech Diarization: Segment and label speech for each speaker.
Speech Transcription: Transcribe spoken content.

Implications: Enhancing meeting transcription tools and improving human-computer interaction in social settings.

Social Interactions

Goal: Identify social cues in conversations, such as attention and communication direction.

Tasks:

Looking at Me (LAM): Detect when people are looking at the camera wearer.
Talking to Me (TTM): Detect when people are talking to the camera wearer.

Implications: Supports the development of socially aware AI, aiding in communication assistance and social robots.

Forecasting

Goal: Predict future movements and interactions of the camera wearer.

Tasks:

Locomotion Prediction: Predict the wearer's future paths.
Hand Movement Prediction: Predict future hand positions.
Short-term Object Interaction Anticipation: Predict future interactions with objects.
Long-term Action Anticipation: Predict sequences of future actions.

Implications: Enables anticipatory functions in augmented reality systems and robots, improving their ability to assist proactively.

Implications and Future Directions

Practical Applications:

Augmented Reality (AR): Enhancing user experiences by anticipating their needs and actions.
Service Robots: Enabling robots to better understand and predict human actions for more seamless assistance.
Personal Assistants: Developing more intuitive and helpful personal assistant technologies that can recall and predict user needs.

Theoretical Developments:

Vision and Language Integration: Deepen understanding of integrating visual inputs with natural language for more context-aware systems.
Interactive Learning: Improve learning algorithms to handle long-term dependencies and complex interactions.

Conclusion

Ego4D represents a significant step forward in providing the data and benchmarks necessary to advance first-person visual understanding. It presents opportunities for breakthroughs across computer vision, robotics, and augmented reality, enabling more intelligent and responsive systems that integrate deeply with human daily life. Researchers leveraging this dataset can push the boundaries of AI in interpreting and responding to the subtleties of human experiences.

PDF Markdown

Related Papers

Find Related Papers