Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Up Dynamic Human-Scene Interaction Modeling (2403.08629v2)

Published 13 Mar 2024 in cs.CV

Abstract: Confronting the challenges of data scarcity and advanced motion synthesis in human-scene interaction modeling, we introduce the TRUMANS dataset alongside a novel HSI motion synthesis method. TRUMANS stands as the most comprehensive motion-captured HSI dataset currently available, encompassing over 15 hours of human interactions across 100 indoor scenes. It intricately captures whole-body human motions and part-level object dynamics, focusing on the realism of contact. This dataset is further scaled up by transforming physical environments into exact virtual models and applying extensive augmentations to appearance and motion for both humans and objects while maintaining interaction fidelity. Utilizing TRUMANS, we devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length, taking into account both scene context and intended actions. In experiments, our approach shows remarkable zero-shot generalizability on a range of 3D scene datasets (e.g., PROX, Replica, ScanNet, ScanNet++), producing motions that closely mimic original motion-captured sequences, as confirmed by quantitative experiments and human studies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Circle: Capture in rich contextual environments. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  2. Behave: Dataset and method for tracking human object interactions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  3. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  4. Long-term human motion prediction with scene context. In European Conference on Computer Vision (ECCV), 2020.
  5. Blender Online Community. Blender - a 3d modelling and rendering package, 2018.
  6. Blender Online Community. Blenderkit. https://www.blenderkit.com/, 2023.
  7. Context-aware human motion prediction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  8. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
  10. 3d-front: 3d furnished rooms with layouts and semantics. In International Conference on Computer Vision (ICCV), 2021.
  11. Imos: Intent-driven full-body motion synthesis for human-object interactions. In Computer Graphics Forum, 2023.
  12. James J Gibson. The perception of the visual world. Houghton Mifflin, 1950.
  13. Generating diverse and natural 3d human motions from text. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  14. Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  15. Interaction replica: Tracking human–object interaction and scene changes from human motion. In International Conference on 3D Vision (3DV), 2023.
  16. Resolving 3d human pose ambiguities with 3d scene constraints. In International Conference on Computer Vision (ICCV), 2019.
  17. Stochastic scene-aware motion prediction. In International Conference on Computer Vision (ICCV), 2021a.
  18. Populating 3d scenes by learning human-scene interaction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
  19. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  20. Capturing and inferring dense full-body human-scene contact. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  21. Diffusion-based generation, optimization, and planning in 3d scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  22. Full-body articulated human-object interaction. In International Conference on Computer Vision (ICCV), 2023.
  23. Guided motion diffusion for controllable human motion synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  24. Ben Kenwright. Inverse kinematics–cyclic coordinate descent (ccd). Journal of Graphics Tools, 2012.
  25. Locomotion-action-manipulation: Synthesizing human-scene interactions in complex 3d environments. In International Conference on Computer Vision (ICCV), 2023.
  26. Object motion guided human motion synthesis. arXiv preprint arXiv:2309.16237, 2023.
  27. Putting humans in a scene: Learning affordance in 3d indoor environments. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  28. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 2015.
  29. 3d human mesh estimation from virtual markers. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  30. The kit whole-body human motion database. In International Conference on Robotics and Automation (ICRA), 2015.
  31. Contact-aware human motion forecasting. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  32. Generating continual human motion in diverse 3d scenes. In International Conference on 3D Vision (3DV), 2023.
  33. imapper: interaction-guided scene mapping from monocular videos. ACM Transactions on Graphics (TOG), 2019.
  34. I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In European Conference on Computer Vision (ECCV), 2020.
  35. Synthesizing physically plausible human motions in 3d scenes. In International Conference on 3D Vision (3DV), 2023.
  36. Expressive body capture: 3d hands, face, and body from a single image. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  37. Hierarchical generation of human-object interactions with diffusion probabilistic models. In International Conference on Computer Vision (ICCV), 2023.
  38. Reallusion. Character creator 4. https://www.reallusion.com/character-creator/, 2023.
  39. Pigraphs: learning interaction snapshots from observations. ACM Transactions on Graphics (TOG), 2016.
  40. Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418, 2023.
  41. Vicon Software. Shogun. https://www.vicon.com/software/shogun/, 2023.
  42. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015.
  43. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems (NeurIPS), 2015.
  44. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  45. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
  46. Grab: A dataset of whole-body human grasping of objects. In European Conference on Computer Vision (ECCV), 2020.
  47. GOAL: Generating 4D whole-body motion for hand-object grasping. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  48. Human motion diffusion model. In International Conference on Learning Representations (ICLR), 2022.
  49. Deco: Dense estimation of 3d human-scene contact in the wild. In International Conference on Computer Vision (ICCV), 2023.
  50. Recovering accurate 3d human pose in the wild using imus and a moving camera. In European Conference on Computer Vision (ECCV), 2018.
  51. 3d human mesh recovery with sequentially global rotation estimation. In International Conference on Computer Vision (ICCV), 2023.
  52. Synthesizing long-term 3d human motion and interaction in 3d scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021a.
  53. Scene-aware generative network for human motion synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
  54. Towards diverse and natural scene-aware 3d human motion synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022a.
  55. Humanise: Language-conditioned human motion generation in 3d scenes. In Advances in Neural Information Processing Systems (NeurIPS), 2022b.
  56. Unified human-scene interaction via prompted chain-of-contacts. arXiv preprint arXiv:2309.07918, 2023.
  57. InterDiff: Generating 3d human-object interactions with physics-informed diffusion. In International Conference on Computer Vision (ICCV), 2023.
  58. Scannet++: A high-fidelity dataset of 3d indoor scenes. In International Conference on Computer Vision (ICCV), 2023.
  59. T2m-gpt: Generating human motion from textual descriptions with discrete representations. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  60. Generating person-scene interactions in 3d scenes. In International Conference on 3D Vision (3DV), 2020a.
  61. Couch: Towards controllable human-chair interactions. In European Conference on Computer Vision (ECCV), 2022.
  62. Generating 3d people in scenes without people. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020b.
  63. Compositional human-scene interaction synthesis with semantic control. In European Conference on Computer Vision (ECCV), 2022.
  64. Synthesizing diverse human motions in 3d indoor scenes. In International Conference on Computer Vision (ICCV), 2023.
  65. On the continuity of rotation representations in neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Citations (21)

Summary

  • The paper introduces TRUMANS—a large-scale, motion-captured dataset—and a novel autoregressive model to synthesize realistic human-scene interactions.
  • It employs a diffusion-based autoregressive algorithm conditioned on scene context and action intentions to generate dynamic sequences.
  • Quantitative experiments and human studies validate the method's performance in closely replicating authentic motion-capture data.

Scaling Up Dynamic Human-Scene Interaction Modeling

The subject of this paper involves the paper of Human-Scene Interaction (HSI) modeling, particularly addressing the challenges of data scarcity and advanced motion synthesis. The paper introduces an extensive dataset called TRUMANS and a novel motion synthesis method, both contributing significantly to the field of HSI.

The TRUMANS dataset is described as the most comprehensive motion-captured HSI dataset currently available. It encompasses over 15 hours of human interactions across 100 diverse indoor scenes. Motion capture in this dataset includes whole-body human motions and part-level object dynamics, emphasizing the realism of contact. This dataset is further scaled by transforming physical environments into accurate virtual models and applying augmentations to both human and object appearance and motion, maintaining interaction fidelity.

The TRUMANS dataset serves as a foundation for a novel computational model employing a diffusion-based autoregressive algorithm for generating HSI sequences of any length. This model is conditioned on scene context and action intentions, demonstrating remarkable zero-shot generalizability on various 3D scene datasets such as PROX, Replica, ScanNet, and ScanNet++. Quantitative experiments and human studies affirm the model's efficacy, closely mimicking motion-captured data sequences.

A structured review of related work reveals the field's limitations, notably the scarcity of high-quality datasets with realistic HSI. Previous datasets like PiGraphs and PROX initiated exploration but faced constraints in scalability and data quality. MoCap datasets, though high in quality, often lacked environmental interaction diversity. Recent developments in synthetic datasets offered cost efficiencies and adaptability but struggled to fully capture realistic 3D HSIs, especially in dynamic contacts and object tracking.

The introduction of the TRUMANS dataset marks a substantial advance. It provides accurate HSI modeling through extensive motion capture and photorealistic rendering, promising enhancements in human pose and contact estimation tasks. Moreover, a diffusion-based autoregressive motion synthesis method is proposed, generating HSIs conditioned on both 3D scene and action labels. This method excels in producing physically plausible and controllable HSI, achieving arbitrary sequence lengths in real time.

Empirical evaluations highlight the effectiveness of both the dataset and the proposed synthesis method. In static settings, the model trained on TRUMANS surpasses baselines with significant improvements in motion synthesis metrics like contact and penetration. The dynamic setting exhibits notable performance in handling human-object interactions. A human paper further corroborates the superiority of the model, as participants often failed to distinguish the synthesized data from actual motion capture.

The implications of this research are notable both theoretically and practically. The introduction of TRUMANS and the corresponding motion synthesis method significantly elevates the quality of HSI modeling. This advancement fosters better generalization in novel environments and suggests potential improvements in related fields such as vision-based perception and interaction modeling. Future developments could see expanded applicability of this fundamental research in various AI-driven tasks requiring sophisticated interaction understanding and prediction.

Conclusively, this paper presents valuable contributions to the field of HSI, offering a robust dataset and innovative method that marks a step forward in modeling complex human-scene interactions. The research promises to serve as a foundation for continued exploration and enhancement of human interaction capabilities within virtual environments.