CG-HOI: Contact-Guided 3D Human-Object Interaction Generation (2311.16097v2)
Abstract: We propose CG-HOI, the first method to address the task of generating dynamic 3D human-object interactions (HOIs) from text. We model the motion of both human and object in an interdependent fashion, as semantically rich human motion rarely happens in isolation without any interactions. Our key insight is that explicitly modeling contact between the human body surface and object geometry can be used as strong proxy guidance, both during training and inference. Using this guidance to bridge human and object motion enables generating more realistic and physically plausible interaction sequences, where the human body and corresponding object move in a coherent manner. Our method first learns to model human motion, object motion, and contact in a joint diffusion process, inter-correlated through cross-attention. We then leverage this learned contact for guidance during inference to synthesize realistic and coherent HOIs. Extensive evaluation shows that our joint contact-based human-object interaction approach generates realistic and physically plausible sequences, and we show two applications highlighting the capabilities of our method. Conditioned on a given object trajectory, we can generate the corresponding human motion without re-training, demonstrating strong human-object interdependency learning. Our approach is also flexible, and can be applied to static real-world 3D scene scans.
- Can we use diffusion probabilistic models for 3d motion prediction? pages 9837–9843, 2023.
- Structured prediction helps 3d human motion modelling. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 7143–7152. IEEE, 2019.
- Listen, denoise, action! audio-driven motion synthesis with diffusion models. ACM Trans. Graph., 42(4):44:1–44:20, 2023.
- A stochastic conditioning scheme for diverse human motion prediction. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 5222–5231. Computer Vision Foundation / IEEE, 2020.
- Make-an-animation: Large-scale text-conditional 3d human motion generation. CoRR, abs/2305.09662, 2023.
- Belfusion: Latent diffusion for behavior-driven human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2317–2327, 2023.
- HP-GAN: probabilistic 3d human motion prediction via GAN. In 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1418–1427. Computer Vision Foundation / IEEE Computer Society, 2018.
- BEHAVE: dataset and method for tracking human object interactions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15914–15925. IEEE, 2022.
- Accurate and diverse sampling of sequences based on a ”best of many” sample objective. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 8485–8493. Computer Vision Foundation / IEEE Computer Society, 2018.
- Physically plausible full-body hand-object interaction synthesis. CoRR, abs/2309.07907, 2023.
- Long-term human motion prediction with scene context. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, pages 387–404. Springer, 2020.
- Humanmac: Masked motion completion for human motion prediction. CoRR, abs/2302.03665, 2023a.
- Executing your commands via motion diffusion in latent space. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 18000–18010. IEEE, 2023b.
- Action-agnostic human pose forecasting. In IEEE Winter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, January 7-11, 2019, pages 1423–1432. IEEE, 2019.
- Diffupose: Monocular 3d human pose estimation via denoising diffusion probabilistic model. CoRR, abs/2212.02796, 2022.
- Mofusion: A framework for denoising-diffusion-based motion synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 9760–9770. IEEE, 2023.
- Forecasting characteristic 3d poses of human actions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 15893–15902. IEEE, 2022a.
- Forecasting actions and characteristic 3d poses. CoRR, abs/2211.14309, 2022b.
- Learning multi-object dynamics with compositional neural radiance fields. In Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, pages 1755–1768. PMLR, 2022.
- Recurrent network models for human dynamics. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 4346–4354. IEEE Computer Society, 2015.
- Imos: Intent-driven full-body motion synthesis for human-object interactions. pages 1–12, 2023.
- A neural temporal model for human motion prediction. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 12116–12125. Computer Vision Foundation / IEEE, 2019.
- Contactopt: Optimizing contact to improve grasps. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 1471–1481. Computer Vision Foundation / IEEE, 2021.
- Action2motion: Conditioned generation of 3d human motions. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, pages 2021–2029. ACM, 2020.
- Generating diverse and natural 3d human motions from text. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 5142–5151. IEEE, 2022.
- Resolving 3d human pose ambiguities with 3d scene constraints. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 2282–2292. IEEE, 2019.
- Stochastic scene-aware motion prediction. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 11354–11364. IEEE, 2021a.
- Populating 3d scenes by learning human-scene interaction. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 14708–14718. Computer Vision Foundation / IEEE, 2021b.
- Classifier-free diffusion guidance. CoRR, abs/2207.12598, 2022.
- Denoising diffusion probabilistic models. 2020.
- Diffusion-based generation, optimization, and planning in 3d scenes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 16750–16761. IEEE, 2023.
- Structural-rnn: Deep learning on spatio-temporal graphs. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 5308–5317. IEEE Computer Society, 2016.
- Planning with diffusion for flexible behavior synthesis. 162:9902–9915, 2022.
- Hand-object contact consistency reasoning for human grasps generation. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 11087–11096. IEEE, 2021.
- CHAIRS: towards full-body articulated human-object interaction. CoRR, abs/2212.10621, 2022.
- Placing human animations into 3d scenes by learning interaction- and geometry-driven keyframes. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023, pages 300–310. IEEE, 2023.
- Action-gpt: Leveraging large-scale language models for improved and generalized zero shot action generation. CoRR, abs/2211.15603, 2022.
- GMD: controllable human motion synthesis via guided diffusion models. CoRR, abs/2305.12577, 2023.
- FLAME: free-form language-based motion synthesis & editing. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 8255–8263. AAAI Press, 2023.
- Interaction capture and synthesis. ACM Trans. Graph., 25(3):872–880, 2006.
- NIFTY: neural object interaction fields for guided human motion synthesis. CoRR, abs/2307.07511, 2023.
- Locomotion-action-manipulation: Synthesizing human-scene interactions in complex 3d environments. CoRR, abs/2301.02667, 2023.
- Task-oriented human-object interactions generation with implicit neural representations. CoRR, abs/2303.13129, 2023.
- SMPL: a skinned multi-person linear model. pages 248:1–248:16, 2015.
- Learning trajectory dependencies for human motion prediction. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 9488–9496. IEEE, 2019.
- History repeats itself: Human motion prediction via motion attention. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XIV, pages 474–489. Springer, 2020.
- Generating smooth pose sequences for diverse human motion prediction. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 13289–13298. IEEE, 2021.
- Contact-aware human motion forecasting. 2022a.
- Weakly-supervised action transition learning for stochastic human motion prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 8141–8150. IEEE, 2022b.
- On human motion prediction using recurrent neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4674–4683. IEEE Computer Society, 2017.
- Flexible neural representation for physics prediction. pages 8813–8824, 2018.
- GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. 162:16784–16804, 2022.
- Expressive body capture: 3d hands, face, and body from a single image. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 10975–10985. Computer Vision Foundation / IEEE, 2019.
- Action-conditioned 3d human motion synthesis with transformer VAE. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 10965–10975. IEEE, 2021.
- TEMOS: generating diverse human motions from textual descriptions. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXII, pages 480–497. Springer, 2022.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 77–85. IEEE Computer Society, 2017.
- Single motion diffusion. CoRR, abs/2302.05905, 2023.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pages 8748–8763. PMLR, 2021.
- Predicting the physical dynamics of unseen 3d objects. In IEEE Winter Conference on Applications of Computer Vision, WACV 2020, Snowmass Village, CO, USA, March 1-5, 2020, pages 2823–2832. IEEE, 2020.
- Trace and pace: Controllable pedestrian animation via guided trajectory diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 13756–13766. IEEE, 2023.
- Diffusion motion: Generate text-guided 3d human motion by diffusion model. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE, 2023.
- Embodied hands: modeling and capturing hands and bodies together. ACM Trans. Graph., 36(6):245:1–245:17, 2017.
- Human motion diffusion as a generative prior. CoRR, abs/2303.01418, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 2256–2265. JMLR.org, 2015.
- Denoising diffusion implicit models. 2021.
- Towards globally consistent stochastic human motion prediction via motion diffusion. CoRR, abs/2305.12554, 2023.
- GOAL: generating 4d whole-body motion for hand-object grasping. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 13253–13263. IEEE, 2022.
- Long-term human motion prediction by modeling motion context and enhancing motion dynamics. pages 935–941, 2018.
- FLEX: full-body grasping without full-body grasps. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 21179–21189. IEEE, 2023.
- Motionclip: Exposing human motion generation to CLIP space. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXII, pages 358–374. Springer, 2022.
- Human motion diffusion model. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- Transfusion: A practical and effective transformer-based diffusion model for 3d human motion prediction. CoRR, abs/2307.16106, 2023.
- S22{}^{\mbox{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTcontact: Graph-based network for 3d hand-object contact estimation with semi-supervised learning. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part I, pages 568–584. Springer, 2022.
- Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
- Synthesizing long-term 3d human motion and interaction in 3d scenes. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 9401–9411. Computer Vision Foundation / IEEE, 2021a.
- Scene-aware generative network for human motion synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 12206–12215. Computer Vision Foundation / IEEE, 2021b.
- Towards diverse and natural scene-aware 3d human motion synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 20428–20437. IEEE, 2022a.
- Fg-t2m: Fine-grained text-driven human motion generation via diffusion model. pages 22035–22044, 2023.
- HUMANISE: language-conditioned human motion generation in 3d scenes. 2022b.
- Human joint kinematics diffusion-refinement for stochastic motion prediction. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 6110–6118. AAAI Press, 2023a.
- Understanding text-driven motion synthesis with keyframe collaboration via diffusion models. CoRR, abs/2305.13773, 2023b.
- SAGA: stochastic whole-body grasping with contact. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VI, pages 257–274. Springer, 2022.
- Unified human-scene interaction via prompted chain-of-contacts. CoRR, abs/2309.07918, 2023.
- Interdiff: Generating 3d human-object interactions with physics-informed diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14928–14940, 2023a.
- Stochastic multi-person 3d motion forecasting. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023b.
- MT-VAE: learning motion transformations to generate multimodal human dynamics. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part V, pages 276–293. Springer, 2018.
- Longdancediff: Long-term dance generation with conditional diffusion model. CoRR, abs/2308.11945, 2023a.
- Synthesizing long-term human motions with diffusion models via coherent sampling. pages 3954–3964, 2023b.
- Affordance diffusion: Synthesizing hand-object interactions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 22479–22489. IEEE, 2023.
- Scannet++: A high-fidelity dataset of 3d indoor scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023.
- Dlow: Diversifying latent flows for diverse human motion prediction. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IX, pages 346–364. Springer, 2020.
- Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16010–16021, 2023.
- T2M-GPT: generating human motion from textual descriptions with discrete representations. CoRR, abs/2301.06052, 2023a.
- Motiondiffuse: Text-driven human motion generation with diffusion model. CoRR, abs/2208.15001, 2022a.
- Remodiffuse: Retrieval-augmented motion diffusion model. CoRR, abs/2304.01116, 2023b.
- PLACE: proximity learning of articulation and contact in 3d environments. In 8th International Conference on 3D Vision, 3DV 2020, Virtual Event, Japan, November 25-28, 2020, pages 642–651. IEEE, 2020a.
- ROAM: robust and object-aware motion generation using neural pose descriptors. CoRR, abs/2308.12969, 2023c.
- COUCH: towards controllable human-chair interactions. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part V, pages 518–535. Springer, 2022b.
- The wanderings of odysseus in 3d scenes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 20449–20459. IEEE, 2022.
- Generating 3d people in scenes without people. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6194–6204, 2020b.
- We are more than our joints: Predicting how 3d bodies move. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 3372–3382. Computer Vision Foundation / IEEE, 2021.
- Tedi: Temporally-entangled diffusion for long-term motion synthesis. CoRR, abs/2307.15042, 2023d.
- Compositional human-scene interaction synthesis with semantic control. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VI, pages 311–327. Springer, 2022.
- Synthesizing diverse human motions in 3d indoor scenes. CoRR, abs/2305.12411, 2023a.
- Modiff: Action-conditioned 3d motion generation with denoising diffusion probabilistic models. CoRR, abs/2301.03949, 2023b.
- CAMS: canonicalized manipulation spaces for category-level functional hand-object manipulation synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 585–594. IEEE, 2023.
- GIMO: gaze-informed human motion prediction in context. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XIII, pages 676–694. Springer, 2022.
- TOCH: spatio-temporal object-to-hand correspondence for motion refinement. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part III, pages 1–19. Springer, 2022.
- On the continuity of rotation representations in neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 5745–5753. Computer Vision Foundation / IEEE, 2019.
- Object-oriented dynamics predictor. pages 9826–9837, 2018.
- Christian Diller (4 papers)
- Angela Dai (84 papers)