Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EgoGen: An Egocentric Synthetic Data Generator (2401.08739v2)

Published 16 Jan 2024 in cs.CV and cs.AI

Abstract: Understanding the world in first-person view is fundamental in Augmented Reality (AR). This immersive perspective brings dramatic visual changes and unique challenges compared to third-person views. Synthetic data has empowered third-person-view vision models, but its application to embodied egocentric perception tasks remains largely unexplored. A critical challenge lies in simulating natural human movements and behaviors that effectively steer the embodied cameras to capture a faithful egocentric representation of the 3D world. To address this challenge, we introduce EgoGen, a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks. At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment. Combined with collision-avoiding motion primitives and a two-stage reinforcement learning approach, our motion synthesis model offers a closed-loop solution where the embodied perception and movement of the virtual human are seamlessly coupled. Compared to previous works, our model eliminates the need for a pre-defined global path, and is directly applicable to dynamic environments. Combined with our easy-to-use and scalable data generation pipeline, we demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras, egocentric camera tracking, and human mesh recovery from egocentric views. EgoGen will be fully open-sourced, offering a practical solution for creating realistic egocentric training data and aiming to serve as a useful tool for egocentric computer vision research. Refer to our project page: https://ego-gen.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (133)
  1. https://meshcapade.com,, 2022.
  2. Efficient reconstruction of large unordered image datasets for high accuracy photogrammetric applications. In ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Melbourne, Australia. XXII ISPRS Congress, 2012.
  3. Legged locomotion in challenging terrains using egocentric vision. In Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, pages 403–415. PMLR, 2022.
  4. Avoiding moving obstacles. Experimental Brain Research, 190(3):251–264, 2008.
  5. Unrealego: A new dataset for robust egocentric 3d human motion capture. In European Conference on Computer Vision (ECCV), 2022.
  6. What matters in on-policy reinforcement learning? A large-scale empirical study. CoRR, abs/2006.05990, 2020.
  7. Apple. ARKit. https://developer.apple.com/arkit/, 2017.
  8. The rrads platform: a real road autonomous driving simulator. In Proceedings of the 7th International Conference on Automotive User Interfaces and Interactive Vehicular Applications, pages 281–288, 2015.
  9. BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023.
  10. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision, pages 561–578, 2016.
  11. Playing for 3d human recovery. arXiv preprint arXiv:2110.07588, 2021.
  12. Simon Clavet. Motion matching and the road to next-gen animation. In Proc. of GDC, 2016.
  13. Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
  14. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2021.
  15. The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(11):4125–4141, 2021.
  16. Andrew J. Davison. Real-time simultaneous localisation and mapping with a single camera. In ICCV, 2003.
  17. Superpoint: Self-supervised interest point detection and description, 2018.
  18. Carla: An open urban driving simulator, 2017.
  19. Psp-hdri+: A synthetic dataset generator for pre-training of human-centric computer vision models. In First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022, 2022.
  20. Learning to detect and track visible and occluded body joints in a virtual world. In European Conference on Computer Vision (ECCV), 2018.
  21. Headlock: A wearable navigation aid that helps blind cane users traverse large open spaces. In Proceedings of the 16th international ACM SIGACCESS conference on Computers & accessibility, pages 19–26, 2014.
  22. Vrkitchen: an interactive 3d virtual environment for task-oriented learning. arXiv, abs/1903.05757, 2019.
  23. Generating and characterizing scenarios for safety testing of autonomous vehicles, 2021.
  24. Google. ARCore. https://developers.google.com/ar/, 2018.
  25. Venu Madhav Govindu. Combining two-view constraints for motion estimation. In CVPR, 2001.
  26. Ego4d: Around the World in 3,000 Hours of Egocentric Video. In IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2022.
  27. Kubric: A scalable dataset generator. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 3739–3751. IEEE, 2022.
  28. HOOD: hierarchical graphs for generalized modelling of clothing dynamics. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 16965–16974. IEEE, 2023.
  29. Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4318–4329, 2021.
  30. Performance test on uav-based photogrammetric data collection. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2012.
  31. A benchmark for rgb-d visual odometry, 3d reconstruction and slam. ICRA, 2014.
  32. Towards viewpoint invariant 3d human pose estimation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 160–177. Springer, 2016.
  33. Stochastic scene-aware motion prediction. In Proceedings of the International Conference on Computer Vision 2021, 2021.
  34. Synthesizing physical character-scene interactions. arXiv preprint arXiv:2302.00883, 2023.
  35. Phase-functioned neural networks for character control. ACM Trans. Graph., 36(4):1–13, 2017.
  36. Learned motion matching. ACM Transactions on Graphics (TOG), 39(4):53–1, 2020.
  37. Image Matching across Wide Baselines: From Paper to Practice. IJCV, 2021.
  38. Neena Kamath. Announcing Azure Spatial Anchors for collaborative, cross-platform mixed reality apps. https://azure.microsoft.com/en-us/blog/announcing-azure-spatial-anchors-for-collaborative-cross-platform-mixed-reality-apps/, 2019.
  39. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7122–7131, 2018.
  40. PoseNet: A Convolutional Network for Real-Time 6-DoF Camera Relocalization. In ICCV, 2015.
  41. Parallel tracking and mapping for small ar workspaces. In IEEE and ACM International Symposium on Mixed and Augmented Reality, 2007.
  42. PARE: Part attention regressor for 3D human body estimation. In Proceedings International Conference on Computer Vision (ICCV), pages 11127–11137. IEEE, 2021.
  43. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In Proceedings of the IEEE International Conference on Computer Vision, pages 2252–2261, 2019a.
  44. Convolutional mesh regression for single-image human shape reconstruction. In CVPR, 2019b.
  45. Probabilistic modeling for human mesh recovery. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11605–11614, 2021.
  46. Motion graphs. In International Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2008, Los Angeles, California, USA, August 11-15, 2008, Classes, pages 51:1–51:10. ACM, 2008.
  47. Ego-body pose estimation via ego-head pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 17142–17151. IEEE, 2023.
  48. CLIFF: Carrying location information in full frames into human pose and shape estimation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pages 590–606. Springer, 2022.
  49. Character controllers using motion vaes. ACM Trans. Graph., 39(4), 2020.
  50. 4d human body capture from egocentric video via 3d scene grounding. In 2021 International Conference on 3D Vision (3DV), pages 930–939. IEEE, 2021.
  51. Kinematics-guided reinforcement learning for object-aware 3d ego-pose estimation. arXiv preprint arXiv:2011.04837, 2020.
  52. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019.
  53. Residual pose: A decoupled approach for depth-based 3d human pose estimation. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10313–10318. IEEE, 2020.
  54. Meta. Project Aria Glasses. https://www.projectaria.com/, 2023.
  55. Microsoft. HoloLens 2. https://www.microsoft.com/en-us/hololens, 2019.
  56. COAP: Compositional articulated occupancy of people. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2022.
  57. Generating continual human motion in diverse 3d scenes. CoRR, abs/2304.02061, 2023.
  58. Perspectives on standardization in mobile robot programming: the carnegie mellon navigation (CARMEN) toolkit. In 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, Nevada, USA, October 27 - November 1, 2003, pages 2436–2441. IEEE, 2003.
  59. V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In Proceedings of the IEEE conference on computer vision and pattern Recognition, pages 5079–5088, 2018.
  60. Real time localization and 3d reconstruction. In CVPR, 2006.
  61. You2me: Inferring body pose in egocentric video via first and second person interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9890–9900, 2020.
  62. Reassessing the limitations of cnn methods for camera pose regression, 2021.
  63. Niantic. Niantic Expands Developer Platform and AR Tools with Niantic Lightship. https://nianticlabs.com/news/lightship/, 2021.
  64. Visual odometry. In CVPR, 2004.
  65. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. CoRR, abs/2306.06362, 2023.
  66. Time limits in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, pages 4045–4054, Stockholmsmässan, Stockholm Sweden, 2018. PMLR.
  67. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
  68. Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (TOG), 40(4):1–20, 2021.
  69. Action-conditioned 3d human motion synthesis with transformer vae, 2021.
  70. Visual modeling with a hand-held camera. IJCV, 2004.
  71. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8494–8502, 2018.
  72. Watch-and-help: A challenge for social perception and human-{ai} collaboration. In International Conference on Learning Representations, 2021.
  73. Habitat 3.0: A co-habitat for humans, avatars and robots. CoRR, abs/2310.13724, 2023.
  74. 3DPeople: Modeling the Geometry of Dressed Humans. In International Conference in Computer Vision (ICCV), 2019.
  75. Ros: an open-source robot operating system. In ICRA workshop on open source software, page 5. Kobe, Japan, 2009.
  76. A semantic occlusion model for human pose estimation from a single depth image. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 67–74, 2015.
  77. Infinite photorealistic worlds using procedural generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12630–12641, 2023.
  78. Tilman Reinhardt. Google Visual Positioning Service. https://ai.googleblog.com/2019/02/using-global-localization-to-improve.html, 2019.
  79. Humor: 3d human motion model for robust pose estimation. In International Conference on Computer Vision (ICCV), 2021.
  80. Trace and pace: Controllable pedestrian animation via guided trajectory diffusion, 2023.
  81. Lgsvl simulator: A high fidelity simulator for autonomous driving. In 2020 IEEE 23rd International conference on intelligent transportation systems (ITSC), pages 1–6. IEEE, 2020.
  82. 3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans. In Robotics: Science and Systems (RSS), 2020.
  83. Superglue: Learning feature matching with graph neural networks, 2020.
  84. LaMAR: Benchmarking Localization and Mapping for Augmented Reality. In ECCV, 2022.
  85. Benchmarking 6DoF outdoor visual localization in changing conditions. In CVPR, 2018.
  86. Habitat: A platform for embodied ai research, 2019.
  87. Structure-from-motion revisited. In CVPR, 2016.
  88. Comparative evaluation of hand-crafted and learned local features. In CVPR, 2017.
  89. Proximal policy optimization algorithms, 2017.
  90. Synthetic training for accurate 3d human pose and shape estimation in the wild. In British Machine Vision Conference (BMVC), 2020.
  91. Motion capture from body-mounted cameras. ACM Trans. Graph., 30(4):31, 2011.
  92. Efficient human pose estimation from single depth images. IEEE transactions on pattern analysis and machine intelligence, 35(12):2821–2840, 2012.
  93. Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. In CVPR, 2013a.
  94. Real-time human pose recognition in parts from single depth images. Commun. ACM, 56(1):116–124, 2013b.
  95. Photo tourism: exploring photo collections in 3d. ACM Trans. Graph., 25(3):835–846, 2006.
  96. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2015.
  97. Neural state machine for character-scene interactions. ACM Trans. Graph., 38(6):209–1, 2019.
  98. Deepphase: Periodic autoencoders for learning motion phase manifolds. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022.
  99. The Replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
  100. Habitat 2.0: Training home assistants to rearrange their habitat. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  101. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. Advances in neural information processing systems, 2021.
  102. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
  103. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012a.
  104. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012b.
  105. xr-egopose: Egocentric 3d human pose from an hmd camera. In Proceedings of the IEEE International Conference on Computer Vision, pages 7728–7738, 2019.
  106. Selfpose: 3d egocentric pose estimation from a headset mounted camera. arXiv preprint arXiv:2011.01519, 2020.
  107. Unity Technologies. Unity Perception package. https://github.com/Unity-Technologies/com.unity.perception, 2020.
  108. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  109. Human pose estimation from depth images via inference embedded multi-task learning. In Proceedings of the 24th ACM international conference on Multimedia, pages 1227–1236, 2016.
  110. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20270–20281, 2023a.
  111. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation, 2023b.
  112. Tianshou: A highly modularized deep reinforcement learning library. Journal of Machine Learning Research, 23(267):1–6, 2022.
  113. DD-PPO: learning near-perfect pointgoal navigators from 2.5 billion frames. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  114. Robust global translations with 1dsfm. In ECCV, 2014.
  115. Physics-based character controllers using conditional vaes. ACM Trans. Graph., 41(4), 2022.
  116. Fake it till you make it: face analysis in the wild using synthetic data alone. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3681–3691, 2021.
  117. A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 793–802, 2019.
  118. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023a.
  119. Synbody: Synthetic dataset with layered human models for 3d human perception and modeling, 2023b.
  120. Controlvae: Model-based learning of generative controllers for physics-based characters. ACM Trans. Graph., 41(6), 2022.
  121. Accurate 3d pose estimation from a single depth image. In 2011 International Conference on Computer Vision, pages 731–738. IEEE, 2011.
  122. Decoupling human and camera motion from videos in the wild. arXiv preprint arXiv:2302.12827, 2023.
  123. Ego-pose estimation and forecasting as real-time pd control. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10082–10092, 2019.
  124. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022a.
  125. Egobody: Human body shape and motion of interacting people from head-mounted devices. In European conference on computer vision (ECCV), 2022b.
  126. Probabilistic human mesh recovery in 3d scenes from egocentric views. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  127. The wanderings of odysseus in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20481–20491, 2022.
  128. We are more than our joints: Predicting how 3d bodies move. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3372–3382, 2021.
  129. Synthesizing diverse human motions in 3d indoor scenes. In International conference on computer vision (ICCV), 2023.
  130. Designing ar visualizations to facilitate stair navigation for people with low vision. In Proceedings of the 32nd annual ACM symposium on user interface software and technology, pages 387–402, 2019.
  131. Gimo: Gaze-informed human motion prediction in context. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII, pages 676–694. Springer, 2022.
  132. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In ICCV, 2023.
  133. Robot parkour learning. In Conference on Robot Learning (CoRL), 2023.
Citations (7)

Summary

We haven't generated a summary for this paper yet.