Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives (2311.18259v4)

Published 30 Nov 2023 in cs.CV and cs.AI

Abstract: We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g., sports, music, dance, bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera poses, IMU, and multiple paired language descriptions -- including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity, we also present a suite of benchmark tasks and their annotations, including fine-grained activity understanding, proficiency estimation, cross-view translation, and 3D hand/body pose. All resources are open sourced to fuel new research in the community. Project page: http://ego-exo4d-data.org/

Authors (101)

Kristen Grauman (136 papers)
Andrew Westbury (4 papers)
Lorenzo Torresani (73 papers)
Kris Kitani (96 papers)
Jitendra Malik (211 papers)
Triantafyllos Afouras (29 papers)
Kumar Ashutosh (17 papers)
Vijay Baiyya (4 papers)
Siddhant Bansal (11 papers)
Bikram Boote (7 papers)
Eugene Byrne (2 papers)
Zach Chavis (2 papers)
Joya Chen (18 papers)
Feng Cheng (37 papers)
Fu-Jen Chu (16 papers)
Sean Crane (2 papers)
Avijit Dasgupta (4 papers)
Jing Dong (125 papers)
Maria Escobar (8 papers)
Cristhian Forigua (4 papers)

Citations (80)

View on Semantic Scholar

Summary

Overview of "Ego-Exo4D: Understanding Skilled Human Activity"

This paper introduces Ego-Exo4D, a large-scale dataset and benchmark aimed at enhancing AI's understanding of skilled human activity through both egocentric (first-person) and exocentric (third-person) perspectives. The dataset encompasses 1,286 hours of video covering a wide array of skilled activities, notably sports, music, dance, and procedural tasks like bike repair. These activities occur across 123 natural contexts in 13 cities worldwide, involving 740 diverse participants.

Dataset Composition

Ego-Exo4D is distinguished by its use of synchronized egocentric and exocentric video captures, enriched with multimodal data such as multichannel audio, eye gaze, 3D point clouds, camera poses, and IMU readings. Furthermore, it provides unique language descriptions from expert commentary, adding a depth of qualitative insights into the performances.

Benchmark and Annotations

The dataset introduces innovative benchmark tasks including proficiency estimation, cross-view translation, fine-grained activity understanding, and 3D hand/body pose detection. These tasks are supported by comprehensive annotations, such as keystep segments and skill-level ratings, enabling rigorous testing of AI models for understanding the nuances of skilled activities.

Numerical Insights

Ego-Exo4D comprises over 5,035 individual instances, each ranging from 1 to 42 minutes. It offers exceptional scale and detail, with annotations requiring over 200,000 hours of human effort, illustrating the dataset's depth and potential research value.

Implications and Future Directions

Ego-Exo4D presents significant implications for advancing AI in areas like augmented reality, where real-time guidance could be enhanced by better understanding human intention and execution of skills. In robotics, the dataset could fuel advancements in learning from human demonstration, further bridging the gap between human and machine interaction.

Moreover, the dataset's open-sourced nature allows for continuous community-driven research development. The cross-modal learning possibilities foster a path for creating more adaptive, perceptive AI systems in practical and theoretical contexts.

Conclusion

Ego-Exo4D stands out as a meaningful contribution to AI research, particularly in video understanding. By merging first- and third-person perspectives with detailed multimodal annotations, it provides a robust framework for exploring the complexities of human skill and teaching AI systems to perceive and assist in real-world tasks more effectively. As AI continues to evolve, datasets like Ego-Exo4D will be pivotal in enabling machines to comprehend and emulate the intricacies of human expertise across various domains.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AIatMeta/status/1775193292390719496

https://twitter.com/1034844617261248512/status/1736825218054992041

https://twitter.com/anfurnari/status/1762258102496199134

https://twitter.com/dushan808/status/1834505437527347504

https://twitter.com/thelokasiffers/status/1937821846352592951