Scaling Up Dynamic Human-Scene Interaction Modeling (2403.08629v2)
Abstract: Confronting the challenges of data scarcity and advanced motion synthesis in human-scene interaction modeling, we introduce the TRUMANS dataset alongside a novel HSI motion synthesis method. TRUMANS stands as the most comprehensive motion-captured HSI dataset currently available, encompassing over 15 hours of human interactions across 100 indoor scenes. It intricately captures whole-body human motions and part-level object dynamics, focusing on the realism of contact. This dataset is further scaled up by transforming physical environments into exact virtual models and applying extensive augmentations to appearance and motion for both humans and objects while maintaining interaction fidelity. Utilizing TRUMANS, we devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length, taking into account both scene context and intended actions. In experiments, our approach shows remarkable zero-shot generalizability on a range of 3D scene datasets (e.g., PROX, Replica, ScanNet, ScanNet++), producing motions that closely mimic original motion-captured sequences, as confirmed by quantitative experiments and human studies.
- Circle: Capture in rich contextual environments. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Behave: Dataset and method for tracking human object interactions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Long-term human motion prediction with scene context. In European Conference on Computer Vision (ECCV), 2020.
- Blender Online Community. Blender - a 3d modelling and rendering package, 2018.
- Blender Online Community. Blenderkit. https://www.blenderkit.com/, 2023.
- Context-aware human motion prediction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
- 3d-front: 3d furnished rooms with layouts and semantics. In International Conference on Computer Vision (ICCV), 2021.
- Imos: Intent-driven full-body motion synthesis for human-object interactions. In Computer Graphics Forum, 2023.
- James J Gibson. The perception of the visual world. Houghton Mifflin, 1950.
- Generating diverse and natural 3d human motions from text. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Interaction replica: Tracking human–object interaction and scene changes from human motion. In International Conference on 3D Vision (3DV), 2023.
- Resolving 3d human pose ambiguities with 3d scene constraints. In International Conference on Computer Vision (ICCV), 2019.
- Stochastic scene-aware motion prediction. In International Conference on Computer Vision (ICCV), 2021a.
- Populating 3d scenes by learning human-scene interaction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Capturing and inferring dense full-body human-scene contact. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Diffusion-based generation, optimization, and planning in 3d scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Full-body articulated human-object interaction. In International Conference on Computer Vision (ICCV), 2023.
- Guided motion diffusion for controllable human motion synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Ben Kenwright. Inverse kinematics–cyclic coordinate descent (ccd). Journal of Graphics Tools, 2012.
- Locomotion-action-manipulation: Synthesizing human-scene interactions in complex 3d environments. In International Conference on Computer Vision (ICCV), 2023.
- Object motion guided human motion synthesis. arXiv preprint arXiv:2309.16237, 2023.
- Putting humans in a scene: Learning affordance in 3d indoor environments. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 2015.
- 3d human mesh estimation from virtual markers. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- The kit whole-body human motion database. In International Conference on Robotics and Automation (ICRA), 2015.
- Contact-aware human motion forecasting. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Generating continual human motion in diverse 3d scenes. In International Conference on 3D Vision (3DV), 2023.
- imapper: interaction-guided scene mapping from monocular videos. ACM Transactions on Graphics (TOG), 2019.
- I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In European Conference on Computer Vision (ECCV), 2020.
- Synthesizing physically plausible human motions in 3d scenes. In International Conference on 3D Vision (3DV), 2023.
- Expressive body capture: 3d hands, face, and body from a single image. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Hierarchical generation of human-object interactions with diffusion probabilistic models. In International Conference on Computer Vision (ICCV), 2023.
- Reallusion. Character creator 4. https://www.reallusion.com/character-creator/, 2023.
- Pigraphs: learning interaction snapshots from observations. ACM Transactions on Graphics (TOG), 2016.
- Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418, 2023.
- Vicon Software. Shogun. https://www.vicon.com/software/shogun/, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015.
- Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
- Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
- Grab: A dataset of whole-body human grasping of objects. In European Conference on Computer Vision (ECCV), 2020.
- GOAL: Generating 4D whole-body motion for hand-object grasping. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Human motion diffusion model. In International Conference on Learning Representations (ICLR), 2022.
- Deco: Dense estimation of 3d human-scene contact in the wild. In International Conference on Computer Vision (ICCV), 2023.
- Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), 2018.
- 3d human mesh recovery with sequentially global rotation estimation. In International Conference on Computer Vision (ICCV), 2023.
- Synthesizing long-term 3d human motion and interaction in 3d scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021a.
- Scene-aware generative network for human motion synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
- Towards diverse and natural scene-aware 3d human motion synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022a.
- Humanise: Language-conditioned human motion generation in 3d scenes. In Advances in Neural Information Processing Systems (NeurIPS), 2022b.
- Unified human-scene interaction via prompted chain-of-contacts. arXiv preprint arXiv:2309.07918, 2023.
- InterDiff: Generating 3d human-object interactions with physics-informed diffusion. In International Conference on Computer Vision (ICCV), 2023.
- Scannet++: A high-fidelity dataset of 3d indoor scenes. In International Conference on Computer Vision (ICCV), 2023.
- T2m-gpt: Generating human motion from textual descriptions with discrete representations. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Generating person-scene interactions in 3d scenes. In International Conference on 3D Vision (3DV), 2020a.
- Couch: Towards controllable human-chair interactions. In European Conference on Computer Vision (ECCV), 2022.
- Generating 3d people in scenes without people. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020b.
- Compositional human-scene interaction synthesis with semantic control. In European Conference on Computer Vision (ECCV), 2022.
- Synthesizing diverse human motions in 3d indoor scenes. In International Conference on Computer Vision (ICCV), 2023.
- On the continuity of rotation representations in neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.