EgoGen: An Egocentric Synthetic Data Generator (2401.08739v2)
Abstract: Understanding the world in first-person view is fundamental in Augmented Reality (AR). This immersive perspective brings dramatic visual changes and unique challenges compared to third-person views. Synthetic data has empowered third-person-view vision models, but its application to embodied egocentric perception tasks remains largely unexplored. A critical challenge lies in simulating natural human movements and behaviors that effectively steer the embodied cameras to capture a faithful egocentric representation of the 3D world. To address this challenge, we introduce EgoGen, a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks. At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment. Combined with collision-avoiding motion primitives and a two-stage reinforcement learning approach, our motion synthesis model offers a closed-loop solution where the embodied perception and movement of the virtual human are seamlessly coupled. Compared to previous works, our model eliminates the need for a pre-defined global path, and is directly applicable to dynamic environments. Combined with our easy-to-use and scalable data generation pipeline, we demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras, egocentric camera tracking, and human mesh recovery from egocentric views. EgoGen will be fully open-sourced, offering a practical solution for creating realistic egocentric training data and aiming to serve as a useful tool for egocentric computer vision research. Refer to our project page: https://ego-gen.github.io/.
- https://meshcapade.com,, 2022.
- Efficient reconstruction of large unordered image datasets for high accuracy photogrammetric applications. In ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Melbourne, Australia. XXII ISPRS Congress, 2012.
- Legged locomotion in challenging terrains using egocentric vision. In Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, pages 403–415. PMLR, 2022.
- Avoiding moving obstacles. Experimental Brain Research, 190(3):251–264, 2008.
- Unrealego: A new dataset for robust egocentric 3d human motion capture. In European Conference on Computer Vision (ECCV), 2022.
- What matters in on-policy reinforcement learning? A large-scale empirical study. CoRR, abs/2006.05990, 2020.
- Apple. ARKit. https://developer.apple.com/arkit/, 2017.
- The rrads platform: a real road autonomous driving simulator. In Proceedings of the 7th International Conference on Automotive User Interfaces and Interactive Vehicular Applications, pages 281–288, 2015.
- BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In Proceedings IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023.
- Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision, pages 561–578, 2016.
- Playing for 3d human recovery. arXiv preprint arXiv:2110.07588, 2021.
- Simon Clavet. Motion matching and the road to next-gen animation. In Proc. of GDC, 2016.
- Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
- Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2021.
- The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(11):4125–4141, 2021.
- Andrew J. Davison. Real-time simultaneous localisation and mapping with a single camera. In ICCV, 2003.
- Superpoint: Self-supervised interest point detection and description, 2018.
- Carla: An open urban driving simulator, 2017.
- Psp-hdri+: A synthetic dataset generator for pre-training of human-centric computer vision models. In First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022, 2022.
- Learning to detect and track visible and occluded body joints in a virtual world. In European Conference on Computer Vision (ECCV), 2018.
- Headlock: A wearable navigation aid that helps blind cane users traverse large open spaces. In Proceedings of the 16th international ACM SIGACCESS conference on Computers & accessibility, pages 19–26, 2014.
- Vrkitchen: an interactive 3d virtual environment for task-oriented learning. arXiv, abs/1903.05757, 2019.
- Generating and characterizing scenarios for safety testing of autonomous vehicles, 2021.
- Google. ARCore. https://developers.google.com/ar/, 2018.
- Venu Madhav Govindu. Combining two-view constraints for motion estimation. In CVPR, 2001.
- Ego4d: Around the World in 3,000 Hours of Egocentric Video. In IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2022.
- Kubric: A scalable dataset generator. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 3739–3751. IEEE, 2022.
- HOOD: hierarchical graphs for generalized modelling of clothing dynamics. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 16965–16974. IEEE, 2023.
- Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4318–4329, 2021.
- Performance test on uav-based photogrammetric data collection. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2012.
- A benchmark for rgb-d visual odometry, 3d reconstruction and slam. ICRA, 2014.
- Towards viewpoint invariant 3d human pose estimation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 160–177. Springer, 2016.
- Stochastic scene-aware motion prediction. In Proceedings of the International Conference on Computer Vision 2021, 2021.
- Synthesizing physical character-scene interactions. arXiv preprint arXiv:2302.00883, 2023.
- Phase-functioned neural networks for character control. ACM Trans. Graph., 36(4):1–13, 2017.
- Learned motion matching. ACM Transactions on Graphics (TOG), 39(4):53–1, 2020.
- Image Matching across Wide Baselines: From Paper to Practice. IJCV, 2021.
- Neena Kamath. Announcing Azure Spatial Anchors for collaborative, cross-platform mixed reality apps. https://azure.microsoft.com/en-us/blog/announcing-azure-spatial-anchors-for-collaborative-cross-platform-mixed-reality-apps/, 2019.
- End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7122–7131, 2018.
- PoseNet: A Convolutional Network for Real-Time 6-DoF Camera Relocalization. In ICCV, 2015.
- Parallel tracking and mapping for small ar workspaces. In IEEE and ACM International Symposium on Mixed and Augmented Reality, 2007.
- PARE: Part attention regressor for 3D human body estimation. In Proceedings International Conference on Computer Vision (ICCV), pages 11127–11137. IEEE, 2021.
- Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In Proceedings of the IEEE International Conference on Computer Vision, pages 2252–2261, 2019a.
- Convolutional mesh regression for single-image human shape reconstruction. In CVPR, 2019b.
- Probabilistic modeling for human mesh recovery. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11605–11614, 2021.
- Motion graphs. In International Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2008, Los Angeles, California, USA, August 11-15, 2008, Classes, pages 51:1–51:10. ACM, 2008.
- Ego-body pose estimation via ego-head pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 17142–17151. IEEE, 2023.
- CLIFF: Carrying location information in full frames into human pose and shape estimation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, pages 590–606. Springer, 2022.
- Character controllers using motion vaes. ACM Trans. Graph., 39(4), 2020.
- 4d human body capture from egocentric video via 3d scene grounding. In 2021 International Conference on 3D Vision (3DV), pages 930–939. IEEE, 2021.
- Kinematics-guided reinforcement learning for object-aware 3d ego-pose estimation. arXiv preprint arXiv:2011.04837, 2020.
- Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019.
- Residual pose: A decoupled approach for depth-based 3d human pose estimation. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 10313–10318. IEEE, 2020.
- Meta. Project Aria Glasses. https://www.projectaria.com/, 2023.
- Microsoft. HoloLens 2. https://www.microsoft.com/en-us/hololens, 2019.
- COAP: Compositional articulated occupancy of people. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2022.
- Generating continual human motion in diverse 3d scenes. CoRR, abs/2304.02061, 2023.
- Perspectives on standardization in mobile robot programming: the carnegie mellon navigation (CARMEN) toolkit. In 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems, Las Vegas, Nevada, USA, October 27 - November 1, 2003, pages 2436–2441. IEEE, 2003.
- V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In Proceedings of the IEEE conference on computer vision and pattern Recognition, pages 5079–5088, 2018.
- Real time localization and 3d reconstruction. In CVPR, 2006.
- You2me: Inferring body pose in egocentric video via first and second person interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9890–9900, 2020.
- Reassessing the limitations of cnn methods for camera pose regression, 2021.
- Niantic. Niantic Expands Developer Platform and AR Tools with Niantic Lightship. https://nianticlabs.com/news/lightship/, 2021.
- Visual odometry. In CVPR, 2004.
- Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. CoRR, abs/2306.06362, 2023.
- Time limits in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, pages 4045–4054, Stockholmsmässan, Stockholm Sweden, 2018. PMLR.
- Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
- Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (TOG), 40(4):1–20, 2021.
- Action-conditioned 3d human motion synthesis with transformer vae, 2021.
- Visual modeling with a hand-held camera. IJCV, 2004.
- Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8494–8502, 2018.
- Watch-and-help: A challenge for social perception and human-{ai} collaboration. In International Conference on Learning Representations, 2021.
- Habitat 3.0: A co-habitat for humans, avatars and robots. CoRR, abs/2310.13724, 2023.
- 3DPeople: Modeling the Geometry of Dressed Humans. In International Conference in Computer Vision (ICCV), 2019.
- Ros: an open-source robot operating system. In ICRA workshop on open source software, page 5. Kobe, Japan, 2009.
- A semantic occlusion model for human pose estimation from a single depth image. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 67–74, 2015.
- Infinite photorealistic worlds using procedural generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12630–12641, 2023.
- Tilman Reinhardt. Google Visual Positioning Service. https://ai.googleblog.com/2019/02/using-global-localization-to-improve.html, 2019.
- Humor: 3d human motion model for robust pose estimation. In International Conference on Computer Vision (ICCV), 2021.
- Trace and pace: Controllable pedestrian animation via guided trajectory diffusion, 2023.
- Lgsvl simulator: A high fidelity simulator for autonomous driving. In 2020 IEEE 23rd International conference on intelligent transportation systems (ITSC), pages 1–6. IEEE, 2020.
- 3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans. In Robotics: Science and Systems (RSS), 2020.
- Superglue: Learning feature matching with graph neural networks, 2020.
- LaMAR: Benchmarking Localization and Mapping for Augmented Reality. In ECCV, 2022.
- Benchmarking 6DoF outdoor visual localization in changing conditions. In CVPR, 2018.
- Habitat: A platform for embodied ai research, 2019.
- Structure-from-motion revisited. In CVPR, 2016.
- Comparative evaluation of hand-crafted and learned local features. In CVPR, 2017.
- Proximal policy optimization algorithms, 2017.
- Synthetic training for accurate 3d human pose and shape estimation in the wild. In British Machine Vision Conference (BMVC), 2020.
- Motion capture from body-mounted cameras. ACM Trans. Graph., 30(4):31, 2011.
- Efficient human pose estimation from single depth images. IEEE transactions on pattern analysis and machine intelligence, 35(12):2821–2840, 2012.
- Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images. In CVPR, 2013a.
- Real-time human pose recognition in parts from single depth images. Commun. ACM, 56(1):116–124, 2013b.
- Photo tourism: exploring photo collections in 3d. ACM Trans. Graph., 25(3):835–846, 2006.
- Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2015.
- Neural state machine for character-scene interactions. ACM Trans. Graph., 38(6):209–1, 2019.
- Deepphase: Periodic autoencoders for learning motion phase manifolds. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022.
- The Replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
- Habitat 2.0: Training home assistants to rearrange their habitat. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. Advances in neural information processing systems, 2021.
- Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012a.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012b.
- xr-egopose: Egocentric 3d human pose from an hmd camera. In Proceedings of the IEEE International Conference on Computer Vision, pages 7728–7738, 2019.
- Selfpose: 3d egocentric pose estimation from a headset mounted camera. arXiv preprint arXiv:2011.01519, 2020.
- Unity Technologies. Unity Perception package. https://github.com/Unity-Technologies/com.unity.perception, 2020.
- Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Human pose estimation from depth images via inference embedded multi-task learning. In Proceedings of the 24th ACM international conference on Multimedia, pages 1227–1236, 2016.
- Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20270–20281, 2023a.
- Robogen: Towards unleashing infinite data for automated robot learning via generative simulation, 2023b.
- Tianshou: A highly modularized deep reinforcement learning library. Journal of Machine Learning Research, 23(267):1–6, 2022.
- DD-PPO: learning near-perfect pointgoal navigators from 2.5 billion frames. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
- Robust global translations with 1dsfm. In ECCV, 2014.
- Physics-based character controllers using conditional vaes. ACM Trans. Graph., 41(4), 2022.
- Fake it till you make it: face analysis in the wild using synthetic data alone. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3681–3691, 2021.
- A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 793–802, 2019.
- Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114, 2023a.
- Synbody: Synthetic dataset with layered human models for 3d human perception and modeling, 2023b.
- Controlvae: Model-based learning of generative controllers for physics-based characters. ACM Trans. Graph., 41(6), 2022.
- Accurate 3d pose estimation from a single depth image. In 2011 International Conference on Computer Vision, pages 731–738. IEEE, 2011.
- Decoupling human and camera motion from videos in the wild. arXiv preprint arXiv:2302.12827, 2023.
- Ego-pose estimation and forecasting as real-time pd control. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10082–10092, 2019.
- Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022a.
- Egobody: Human body shape and motion of interacting people from head-mounted devices. In European conference on computer vision (ECCV), 2022b.
- Probabilistic human mesh recovery in 3d scenes from egocentric views. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
- The wanderings of odysseus in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20481–20491, 2022.
- We are more than our joints: Predicting how 3d bodies move. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3372–3382, 2021.
- Synthesizing diverse human motions in 3d indoor scenes. In International conference on computer vision (ICCV), 2023.
- Designing ar visualizations to facilitate stair navigation for people with low vision. In Proceedings of the 32nd annual ACM symposium on user interface software and technology, pages 387–402, 2019.
- Gimo: Gaze-informed human motion prediction in context. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII, pages 676–694. Springer, 2022.
- Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In ICCV, 2023.
- Robot parkour learning. In Conference on Robot Learning (CoRL), 2023.