Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment (2404.00299v2)

Published 30 Mar 2024 in cs.CV

Abstract: Humans naturally interact with both others and the surrounding multiple objects, engaging in various social activities. However, recent advances in modeling human-object interactions mostly focus on perceiving isolated individuals and objects, due to fundamental data scarcity. In this paper, we introduce HOI-M3, a novel large-scale dataset for modeling the interactions of Multiple huMans and Multiple objects. Notably, it provides accurate 3D tracking for both humans and objects from dense RGB and object-mounted IMU inputs, covering 199 sequences and 181M frames of diverse humans and objects under rich activities. With the unique HOI-M3 dataset, we introduce two novel data-driven tasks with companion strong baselines: monocular capture and unstructured generation of multiple human-object interactions. Extensive experiments demonstrate that our dataset is challenging and worthy of further research about multiple human-object interactions and behavior analysis. Our HOI-M3 dataset, corresponding codes, and pre-trained models will be disseminated to the community for future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. Reality capture. https://www.capturingreality.com/realitycap ture.
  2. Easymocap - make human motion capture easier. Github, 2021.
  3. Circle: Capture in rich contextual environments. In CVPR, pages 21211–21221, 2023.
  4. Behave: Dataset and method for tracking human object interactions. In CVPR, pages 15935–15946, 2022.
  5. Long-term human motion prediction with scene context. In ECCV, pages 387–404. Springer, 2020.
  6. XMem: Long-term video object segmentation with an atkinson-shiffrin memory model. In ECCV, 2022.
  7. High-quality streamable free-viewpoint video. ACM Transactions on Graphics (TOG), 34(4):69, 2015.
  8. Anyskill: Learning open-vocabulary physical skill for interactive agents. arXiv preprint arXiv:2403.12835, 2024.
  9. Gravity-aware monocular 3d human-object reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12365–12374, 2021.
  10. Mofusion: A framework for denoising-diffusion-based motion synthesis. In Computer Vision and Pattern Recognition (CVPR), 2023.
  11. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017.
  12. Interfusion: Text-driven generation of 3d human-object interaction. arXiv preprint arXiv:2403.15612, 2024.
  13. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
  14. Cg-hoi: Contact-guided 3d human-object interaction generation. arXiv preprint arXiv:2311.16097, 2023.
  15. Fast and robust multi-person 3d pose estimation and tracking from multiple views. IEEE TPAMI, 44(10):6981–6992, 2021.
  16. Resolving 3d human pose ambiguities with 3d scene constraints. In ICCV, pages 2282–2292, 2019.
  17. Stochastic scene-aware motion prediction. In ICCV, pages 11374–11384, 2021.
  18. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  19. Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
  20. Capturing and inferring dense full-body human-scene contact. In CVPR, pages 13274–13285, 2022a.
  21. Intercap: Joint markerless 3d tracking of humans and objects in interaction. In DAGM German Conference on Pattern Recognition, pages 281–299. Springer, 2022b.
  22. Stackflow: Monocular human-object reconstruction by stacked normalizing flow with offset. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 902–910. International Joint Conferences on Artificial Intelligence Organization, 2023. Main Track.
  23. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
  24. Hand-object contact consistency reasoning for human grasps generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11107–11116, 2021.
  25. Full-body articulated human-object interaction, 2023.
  26. Scaling up dynamic human-scene interaction modeling. arXiv preprint arXiv:2403.08629, 2024.
  27. End-to-end recovery of human shape and pose. In CVPR, pages 7122–7131, 2018.
  28. Gmd: Controllable human motion synthesis via guided diffusion models. arXiv preprint arXiv:2305.12577, 2023.
  29. Neural 3d mesh renderer. In CVPR, pages 3907–3916, 2018.
  30. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  31. Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  32. PARE: Part attention regressor for 3D human body estimation. In ICCV, pages 11127–11137. IEEE, 2021.
  33. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, pages 2252–2261, 2019a.
  34. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, 2019b.
  35. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In CVPR, pages 3383–3393, 2021.
  36. Controllable human-object interaction synthesis. arXiv preprint arXiv:2312.03913, 2023a.
  37. Ego-body pose estimation via ego-head pose estimation. In CVPR, pages 17142–17151, 2023b.
  38. Object motion guided human motion synthesis. arXiv preprint arXiv:2309.16237, 2023c.
  39. Cliff: Carrying location information in full frames into human pose and shape estimation. In ECCV, 2022.
  40. Intergen: Diffusion-based multi-human motion generation under complex interactions. arXiv preprint arXiv:2304.05684, 2023.
  41. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
  42. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019.
  43. Dushyant Mehta. Single-shot multi-person 3d pose estimation from monocular rgb. In 3DV, pages 120–130. IEEE, 2018.
  44. Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In ICCV, pages 10133–10142, 2019.
  45. Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models. arXiv preprint arXiv:2312.06553, 2023.
  46. Hierarchical generation of human-object interactions with diffusion probabilistic models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15061–15073, 2023.
  47. Polycam. 3D CAPTURE, FOR EVERYONE. https://poly.cam/, 2023.
  48. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, pages 652–660, 2017.
  49. Balanced mse for imbalanced visual regression. In CVPR, 2022.
  50. Lidar-aid inertial poser: Large-scale human motion capture by sparse inertial and lidar sensors. IEEE Transactions on Visualization and Computer Graphics, 29(5):2337–2347, 2023.
  51. Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. In ICCV, pages 1749–1759, 2021.
  52. Pigraphs: learning interaction snapshots from observations. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016.
  53. Neural free-viewpoint performance rendering under complex human-object interactions. In ACMMMM, pages 4651–4660, 2021a.
  54. Monocular, one-stage, regression of multiple 3d people. In ICCV, pages 11179–11188, 2021b.
  55. Grab: A dataset of whole-body human grasping of objects. In European conference on computer vision, pages 581–600. Springer, 2020.
  56. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
  57. Capturing hands in action using discriminative salient points and physics simulation. International Journal of Computer Vision, 118:172–193, 2016.
  58. Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV, 2018.
  59. Physhoi: Physics-based imitation of dynamic human-object interaction. arXiv preprint arXiv:2312.04393, 2023.
  60. Humanise: Language-conditioned human motion generation in 3d scenes. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  61. Holistic 3d human and scene mesh estimation from single view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 334–343, 2021.
  62. Chore: Contact, human and object reconstruction from a single rgb image. In ECCV, pages 125–145. Springer, 2022.
  63. Visibility aware human-object interaction tracking from single rgb camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4757–4768, 2023a.
  64. Visibility aware human-object interaction tracking from single rgb camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4757–4768, 2023b.
  65. XSENS. Xsens Technologies B.V. https://www.xsens.com/, 2011.
  66. D3d-hoi: Dynamic 3d human-object interactions from videos, 2021.
  67. Vitpose: Simple vision transformer baselines for human pose estimation. NeurIPS, 35:38571–38584, 2022.
  68. Track anything: Segment anything meets videos, 2023.
  69. Human-aware object placement for visual environment reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3959–3970, 2022.
  70. Mime: Human-aware 3d scene generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12965–12976, 2023.
  71. Exploiting sparse representations for robust analysis of noisy complex video scenes. In ECCV, pages 199–213. Springer, 2012.
  72. Body meshes as points. In CVPR, pages 546–556, 2021a.
  73. Mutual adaptive reasoning for monocular 3d multi-person pose estimation. In ACM MM, pages 1788–1796, 2022a.
  74. Neuraldome: A neural modeling pipeline on multi-view human-object interactions. In CVPR, pages 8834–8845, 2023a.
  75. Ikol: Inverse kinematics optimization layer for 3d human pose and shape estimation via gauss-newton differentiation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023b.
  76. Perceiving 3d human-object spatial arrangements from a single image in the wild. In European Conference on Computer Vision (ECCV), 2020a.
  77. Place: Proximity learning of articulation and contact in 3d environments. In 2020 International Conference on 3D Vision (3DV), pages 642–651. IEEE, 2020b.
  78. Couch: Towards controllable human-chair interactions, 2022b.
  79. Force: Dataset and method for intuitive physics guided human-object interaction. arXiv preprint arXiv:2403.11237, 2024.
  80. 4d association graph for realtime multi-person motion capture using multiple video cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1324–1333, 2020c.
  81. Generating 3d people in scenes without people. In CVPR, pages 6194–6204, 2020d.
  82. We are more than our joints: Predicting how 3d bodies move. In CVPR, pages 3372–3382, 2021b.
  83. I’m hoi: Inertia-aware monocular capture of 3d human-object interactions. arXiv preprint arXiv:2312.08869, 2023.
  84. Smap: Single-shot multi-person absolute 3d pose estimation. In ECCV, pages 550–566. Springer, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Juze Zhang (12 papers)
  2. Jingyan Zhang (4 papers)
  3. Zining Song (2 papers)
  4. Zhanhe Shi (1 paper)
  5. Chengfeng Zhao (6 papers)
  6. Ye Shi (51 papers)
  7. Jingyi Yu (171 papers)
  8. Lan Xu (102 papers)
  9. Jingya Wang (68 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com