WHAC: World-grounded Humans and Cameras (2403.12959v1)
Abstract: Estimating human and camera trajectories with accurate scale in the world coordinate system from a monocular video is a highly desirable yet challenging and ill-posed problem. In this study, we aim to recover expressive parametric human models (i.e., SMPL-X) and corresponding camera poses jointly, by leveraging the synergy between three critical players: the world, the human, and the camera. Our approach is founded on two key observations. Firstly, camera-frame SMPL-X estimation methods readily recover absolute human depth. Secondly, human motions inherently provide absolute spatial cues. By integrating these insights, we introduce a novel framework, referred to as WHAC, to facilitate world-grounded expressive human pose and shape estimation (EHPS) alongside camera pose estimation, without relying on traditional optimization techniques. Additionally, we present a new synthetic dataset, WHAC-A-Mole, which includes accurately annotated humans and cameras, and features diverse interactive human motions as well as realistic camera trajectories. Extensive experiments on both standard and newly established benchmarks highlight the superiority and efficacy of our framework. We will make the code and dataset publicly available.
- Dynamic storyboard generation in an engine-based virtual environment for video production. arXiv preprint arXiv:2301.12688, 2023.
- Hspace: Synthetic parametric humans animated in complex environments. arXiv preprint arXiv:2112.12867, 2021.
- Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737, 2023.
- Digital life project: Autonomous 3d characters with social intelligence. arXiv preprint arXiv:2312.04547, 2023.
- Humman: Multi-modal 4d human dataset for versatile sensing and modeling. In European Conference on Computer Vision, pages 557–577. Springer, 2022.
- Smpler-x: Scaling up expressive human pose and shape estimation. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 11454–11468. Curran Associates, Inc., 2023.
- Playing for 3d human recovery. arXiv preprint arXiv:2110.07588, 2021.
- Dna-rendering: A diverse neural actor repository for high-fidelity human-centric rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19982–19993, 2023.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- Monocular expressive body regression through body-driven attention. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 20–40. Springer, 2020.
- XRFeitoria Contributors. Openxrlab synthetic data rendering toolbox. https://github.com/openxrlab/xrfeitoria, 2023.
- Collaborative regression of expressive bodies using moderation. In 2021 International Conference on 3D Vision (3DV), pages 792–804. IEEE, 2021.
- Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4318–4329, 2021.
- Markerless motion capture with unsynchronized moving cameras. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 224–231. IEEE, 2009.
- Dynamic multi-person mesh recovery from uncalibrated multi-view cameras. In 2021 International Conference on 3D Vision (3DV), pages 710–720. IEEE, 2021.
- Capturing and inferring dense full-body human-scene contact. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13274–13285, 2022.
- Clustered pose and nonlinear appearance models for human pose estimation. In BMVC, pages 1–11. British Machine Vision Association, 2010.
- Panoptic studio: A massively multiview system for social motion capture. In Proceedings of the IEEE International Conference on Computer Vision, pages 3334–3342, 2015.
- Emdb: The electromagnetic database of global 3d human pose and shape in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14632–14643, 2023.
- Beyond weak perspective for monocular 3d human pose estimation. In European Conference on Computer Vision, pages 541–554. Springer, 2020.
- Pace: Human and camera motion estimation from in-the-wild videos. arXiv preprint arXiv:2310.13768, 2023.
- Hybrik-x: Hybrid analytical-neural inverse kinematics for whole-body mesh recovery. arXiv preprint arXiv:2304.05690, 2023.
- D &d: Learning human dynamics from dynamic camera. In European Conference on Computer Vision, pages 479–496. Springer, 2022.
- One-stage 3d whole-body mesh recovery with component aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21159–21168, 2023.
- 4d human body capture from egocentric video via 3d scene grounding. In 2021 international conference on 3D vision (3DV), pages 930–939. IEEE, 2021.
- Scene-aware 3d multi-human motion capture from a single camera. In Computer Graphics Forum, pages 371–383. Wiley Online Library, 2023.
- Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5442–5451, 2019.
- Accurate 3d hand pose estimation for whole-body 3d human mesh estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2308–2317, 2022.
- Towards robust and expressive whole-body human pose and shape estimation. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 17330–17344. Curran Associates, Inc., 2023.
- Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10975–10985, 2019.
- Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR, 2021.
- Humor: 3d human motion model for robust pose estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11488–11499, 2021.
- Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. In Proceedings of the International Conference on Computer Vision, pages 1749–1759, 2021.
- Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
- Wham: Reconstructing world-grounded humans with accurate 3d motion. arXiv preprint arXiv:2312.07531, 2023.
- Duolando: Follower gpt with off-policy reinforcement learning for dance accompaniment. In The Twelfth International Conference on Learning Representations, 2023.
- Trace: 5d temporal regression of avatars with dynamic cameras in 3d environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8856–8866, 2023.
- Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021.
- Deep patch visual odometry. Advances in Neural Information Processing Systems, 36, 2024.
- Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis & Machine Intelligence, 13(04):376–380, 1991.
- Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision (ECCV), pages 601–617, 2018.
- Synbody: Synthetic dataset with layered human models for 3d human perception and modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20282–20292, October 2023.
- Decoupling human and camera motion from videos in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21222–21232, 2023.
- Human dynamics from monocular video with dynamic camera movements. ACM Transactions on Graphics (TOG), 40(6):1–14, 2021.
- Glamr: Global occlusion-aware human mesh recovery with dynamic cameras. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11038–11049, 2022.
- Pymaf-x: Towards well-aligned full-body model regression from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Egobody: Human body shape and motion of interacting people from head-mounted devices. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI, pages 180–200. Springer, 2022.
- 4d association graph for realtime multi-person motion capture using multiple video cameras. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1324–1333, 2020.
- Monocular real-time full body capture with inter-part correlations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4811–4822, 2021.