HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment (2404.00299v2)
Abstract: Humans naturally interact with both others and the surrounding multiple objects, engaging in various social activities. However, recent advances in modeling human-object interactions mostly focus on perceiving isolated individuals and objects, due to fundamental data scarcity. In this paper, we introduce HOI-M3, a novel large-scale dataset for modeling the interactions of Multiple huMans and Multiple objects. Notably, it provides accurate 3D tracking for both humans and objects from dense RGB and object-mounted IMU inputs, covering 199 sequences and 181M frames of diverse humans and objects under rich activities. With the unique HOI-M3 dataset, we introduce two novel data-driven tasks with companion strong baselines: monocular capture and unstructured generation of multiple human-object interactions. Extensive experiments demonstrate that our dataset is challenging and worthy of further research about multiple human-object interactions and behavior analysis. Our HOI-M3 dataset, corresponding codes, and pre-trained models will be disseminated to the community for future research.
- Reality capture. https://www.capturingreality.com/realitycap ture.
- Easymocap - make human motion capture easier. Github, 2021.
- Circle: Capture in rich contextual environments. In CVPR, pages 21211–21221, 2023.
- Behave: Dataset and method for tracking human object interactions. In CVPR, pages 15935–15946, 2022.
- Long-term human motion prediction with scene context. In ECCV, pages 387–404. Springer, 2020.
- XMem: Long-term video object segmentation with an atkinson-shiffrin memory model. In ECCV, 2022.
- High-quality streamable free-viewpoint video. ACM Transactions on Graphics (TOG), 34(4):69, 2015.
- Anyskill: Learning open-vocabulary physical skill for interactive agents. arXiv preprint arXiv:2403.12835, 2024.
- Gravity-aware monocular 3d human-object reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12365–12374, 2021.
- Mofusion: A framework for denoising-diffusion-based motion synthesis. In Computer Vision and Pattern Recognition (CVPR), 2023.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017.
- Interfusion: Text-driven generation of 3d human-object interaction. arXiv preprint arXiv:2403.15612, 2024.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
- Cg-hoi: Contact-guided 3d human-object interaction generation. arXiv preprint arXiv:2311.16097, 2023.
- Fast and robust multi-person 3d pose estimation and tracking from multiple views. IEEE TPAMI, 44(10):6981–6992, 2021.
- Resolving 3d human pose ambiguities with 3d scene constraints. In ICCV, pages 2282–2292, 2019.
- Stochastic scene-aware motion prediction. In ICCV, pages 11374–11384, 2021.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
- Capturing and inferring dense full-body human-scene contact. In CVPR, pages 13274–13285, 2022a.
- Intercap: Joint markerless 3d tracking of humans and objects in interaction. In DAGM German Conference on Pattern Recognition, pages 281–299. Springer, 2022b.
- Stackflow: Monocular human-object reconstruction by stacked normalizing flow with offset. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 902–910. International Joint Conferences on Artificial Intelligence Organization, 2023. Main Track.
- Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
- Hand-object contact consistency reasoning for human grasps generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 11107–11116, 2021.
- Full-body articulated human-object interaction, 2023.
- Scaling up dynamic human-scene interaction modeling. arXiv preprint arXiv:2403.08629, 2024.
- End-to-end recovery of human shape and pose. In CVPR, pages 7122–7131, 2018.
- Gmd: Controllable human motion synthesis via guided diffusion models. arXiv preprint arXiv:2305.12577, 2023.
- Neural 3d mesh renderer. In CVPR, pages 3907–3916, 2018.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- PARE: Part attention regressor for 3D human body estimation. In ICCV, pages 11127–11137. IEEE, 2021.
- Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, pages 2252–2261, 2019a.
- Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, 2019b.
- Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In CVPR, pages 3383–3393, 2021.
- Controllable human-object interaction synthesis. arXiv preprint arXiv:2312.03913, 2023a.
- Ego-body pose estimation via ego-head pose estimation. In CVPR, pages 17142–17151, 2023b.
- Object motion guided human motion synthesis. arXiv preprint arXiv:2309.16237, 2023c.
- Cliff: Carrying location information in full frames into human pose and shape estimation. In ECCV, 2022.
- Intergen: Diffusion-based multi-human motion generation under complex interactions. arXiv preprint arXiv:2304.05684, 2023.
- SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, 2015.
- Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019.
- Dushyant Mehta. Single-shot multi-person 3d pose estimation from monocular rgb. In 3DV, pages 120–130. IEEE, 2018.
- Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In ICCV, pages 10133–10142, 2019.
- Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models. arXiv preprint arXiv:2312.06553, 2023.
- Hierarchical generation of human-object interactions with diffusion probabilistic models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15061–15073, 2023.
- Polycam. 3D CAPTURE, FOR EVERYONE. https://poly.cam/, 2023.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, pages 652–660, 2017.
- Balanced mse for imbalanced visual regression. In CVPR, 2022.
- Lidar-aid inertial poser: Large-scale human motion capture by sparse inertial and lidar sensors. IEEE Transactions on Visualization and Computer Graphics, 29(5):2337–2347, 2023.
- Frankmocap: A monocular 3d whole-body pose estimation system via regression and integration. In ICCV, pages 1749–1759, 2021.
- Pigraphs: learning interaction snapshots from observations. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016.
- Neural free-viewpoint performance rendering under complex human-object interactions. In ACMMMM, pages 4651–4660, 2021a.
- Monocular, one-stage, regression of multiple 3d people. In ICCV, pages 11179–11188, 2021b.
- Grab: A dataset of whole-body human grasping of objects. In European conference on computer vision, pages 581–600. Springer, 2020.
- Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022.
- Capturing hands in action using discriminative salient points and physics simulation. International Journal of Computer Vision, 118:172–193, 2016.
- Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV, 2018.
- Physhoi: Physics-based imitation of dynamic human-object interaction. arXiv preprint arXiv:2312.04393, 2023.
- Humanise: Language-conditioned human motion generation in 3d scenes. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Holistic 3d human and scene mesh estimation from single view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 334–343, 2021.
- Chore: Contact, human and object reconstruction from a single rgb image. In ECCV, pages 125–145. Springer, 2022.
- Visibility aware human-object interaction tracking from single rgb camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4757–4768, 2023a.
- Visibility aware human-object interaction tracking from single rgb camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4757–4768, 2023b.
- XSENS. Xsens Technologies B.V. https://www.xsens.com/, 2011.
- D3d-hoi: Dynamic 3d human-object interactions from videos, 2021.
- Vitpose: Simple vision transformer baselines for human pose estimation. NeurIPS, 35:38571–38584, 2022.
- Track anything: Segment anything meets videos, 2023.
- Human-aware object placement for visual environment reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3959–3970, 2022.
- Mime: Human-aware 3d scene generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12965–12976, 2023.
- Exploiting sparse representations for robust analysis of noisy complex video scenes. In ECCV, pages 199–213. Springer, 2012.
- Body meshes as points. In CVPR, pages 546–556, 2021a.
- Mutual adaptive reasoning for monocular 3d multi-person pose estimation. In ACM MM, pages 1788–1796, 2022a.
- Neuraldome: A neural modeling pipeline on multi-view human-object interactions. In CVPR, pages 8834–8845, 2023a.
- Ikol: Inverse kinematics optimization layer for 3d human pose and shape estimation via gauss-newton differentiation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023b.
- Perceiving 3d human-object spatial arrangements from a single image in the wild. In European Conference on Computer Vision (ECCV), 2020a.
- Place: Proximity learning of articulation and contact in 3d environments. In 2020 International Conference on 3D Vision (3DV), pages 642–651. IEEE, 2020b.
- Couch: Towards controllable human-chair interactions, 2022b.
- Force: Dataset and method for intuitive physics guided human-object interaction. arXiv preprint arXiv:2403.11237, 2024.
- 4d association graph for realtime multi-person motion capture using multiple video cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1324–1333, 2020c.
- Generating 3d people in scenes without people. In CVPR, pages 6194–6204, 2020d.
- We are more than our joints: Predicting how 3d bodies move. In CVPR, pages 3372–3382, 2021b.
- I’m hoi: Inertia-aware monocular capture of 3d human-object interactions. arXiv preprint arXiv:2312.08869, 2023.
- Smap: Single-shot multi-person absolute 3d pose estimation. In ECCV, pages 550–566. Springer, 2020.
- Juze Zhang (12 papers)
- Jingyan Zhang (4 papers)
- Zining Song (2 papers)
- Zhanhe Shi (1 paper)
- Chengfeng Zhao (6 papers)
- Ye Shi (51 papers)
- Jingyi Yu (171 papers)
- Lan Xu (102 papers)
- Jingya Wang (68 papers)