Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance (2403.18036v1)
Abstract: Despite significant advancements in text-to-motion synthesis, generating language-guided human motion within 3D environments poses substantial challenges. These challenges stem primarily from (i) the absence of powerful generative models capable of jointly modeling natural language, 3D scenes, and human motion, and (ii) the generative models' intensive data requirements contrasted with the scarcity of comprehensive, high-quality, language-scene-motion datasets. To tackle these issues, we introduce a novel two-stage framework that employs scene affordance as an intermediate representation, effectively linking 3D scene grounding and conditional motion generation. Our framework comprises an Affordance Diffusion Model (ADM) for predicting explicit affordance map and an Affordance-to-Motion Diffusion Model (AMDM) for generating plausible human motions. By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals, especially when training with limited data lacking extensive language-scene-motion pairs. Our extensive experiments demonstrate that our approach consistently outperforms all baselines on established benchmarks, including HumanML3D and HUMANISE. Additionally, we validate our model's exceptional generalization capabilities on a specially curated evaluation set featuring previously unseen descriptions and scenes.
- Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In European Conference on Computer Vision (ECCV), 2020.
- Text2action: Generative adversarial synthesis from language to action. In International Conference on Robotics and Automation (ICRA), 2018.
- Language2pose: Natural language grounded pose forecasting. In International Conference on 3D Vision (3DV), 2019.
- Circle: Capture in rich contextual environments. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Teach: Temporal action composition for 3d humans. In International Conference on 3D Vision (3DV), 2022.
- SINC: Spatial composition of 3D human motions for simultaneous action generation. In International Conference on Computer Vision (ICCV), 2023.
- Scanqa: 3d question answering for spatial scene understanding. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Hp-gan: Probabilistic 3d human motion prediction via gan. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Long-term human motion prediction with scene context. In European Conference on Computer Vision (ECCV), 2020.
- Matterport3d: Learning from rgb-d data in indoor environments. In International Conference on 3D Vision (3DV), 2017.
- Learning to sit: Synthesizing human-chair interactions via hierarchical control. In AAAI Conference on Artificial Intelligence (AAAI), 2021.
- D3net: a speaker-listener architecture for semi-supervised dense captioning and visual grounding in rgb-d scans. In European Conference on Computer Vision (ECCV), 2022.
- End-to-end 3d dense captioning with vote2cap-detr. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
- Executing your commands via motion diffusion in latent space. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
- Holistic++ scene understanding: Single-view 3d holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. In International Conference on Computer Vision (ICCV), 2019.
- Yourefit: Embodied reference understanding with language and gesture. In International Conference on Computer Vision (ICCV), 2021a.
- Detecting human-object contact in images. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023c.
- Scan2cap: Context-aware dense captioning in rgb-d scans. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
- Anyskill: Learning open-vocabulary physical skill for interactive agents. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Embodied question answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- 3d affordancenet: A benchmark for visual object affordance understanding. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Demo2vec: Reasoning object affordances from online videos. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Synthesis of compositional animations from textual descriptions. In International Conference on Computer Vision (ICCV), 2021.
- Imos: Intent-driven full-body motion synthesis for human-object interactions. In Computer Graphics Forum, 2023.
- James J Gibson. The theory of affordances. Hilldale, USA, 1(2):67–82, 1977.
- What makes a chair a chair? In Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
- Action2motion: Conditioned generation of 3d human motions. In International Conference on Multimedia, 2020.
- Generating diverse and natural 3d human motions from text. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- From 3d scene geometry to human workspace. In Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
- Resolving 3d human pose ambiguities with 3d scene constraints. In International Conference on Computer Vision (ICCV), 2019.
- Stochastic scene-aware motion prediction. In International Conference on Computer Vision (ICCV), 2021a.
- Populating 3d scenes by learning human-scene interaction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (TOG), 35(4):1–11, 2016.
- An embodied generalist agent in 3d world. arXiv preprint arXiv:2311.12871, 2023a.
- Diffusion-based generation, optimization, and planning in 3d scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
- Perceiver: General perception with iterative attention. In International Conference on Machine Learning (ICML), 2021.
- Perceiver io: A general architecture for structured inputs & outputs. In International Conference on Learning Representations (ICLR), 2022.
- Conceptfusion: Open-set multimodal 3d mapping. arXiv preprint arXiv:2302.07241, 2023.
- Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. arXiv preprint arXiv:2401.09340, 2024.
- Hand-object contact consistency reasoning for human grasps generation. In International Conference on Computer Vision (ICCV), 2021.
- Full-body articulated human-object interaction. In International Conference on Computer Vision (ICCV), 2023.
- Scaling up dynamic human-scene interaction modeling. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Lerf: Language embedded radiance fields. In International Conference on Computer Vision (ICCV), 2023.
- Learning task-oriented grasping from human activity datasets. IEEE Robotics and Automation Letters (RA-L), 5(2):3352–3359, 2020.
- Physically grounded spatio-temporal object affordances. In European Conference on Computer Vision (ECCV), 2014.
- Putting people in their place: Affordance-aware human insertion into scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Dancing to music. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Convolutional sequence to sequence model for human dynamics. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In International Conference on Computer Vision (ICCV), 2021.
- Object motion guided human motion synthesis. ACM Transactions on Graphics (TOG), 42(6):1–11, 2023a.
- Gendexgrasp: Generalizable dexterous grasping. In International Conference on Robotics and Automation (ICRA), 2023b.
- Putting humans in a scene: Learning affordance in 3d indoor environments. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Grasp multiple objects with one hand. IEEE Robotics and Automation Letters (RA-L), 9(5):4027–4034, 2024.
- Character controllers using motion vaes. ACM Transactions on Graphics (TOG), 39(4):40–1, 2020.
- Sqa3d: Situated question answering in 3d scenes. In International Conference on Learning Representations (ICLR), 2023.
- Amass: Archive of motion capture as surface shapes. In International Conference on Computer Vision (ICCV), 2019.
- Contact-aware human motion forecasting. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Learning affordance landscapes for interaction exploration in 3d environments. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Grounded human-object interaction hotspots from video. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Expressive body capture: 3d hands, face, and body from a single image. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Openscene: 3d scene understanding with open vocabularies. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (TOG), 40(4):1–20, 2021.
- Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters. ACM Transactions on Graphics (TOG), 41(4):1–17, 2022.
- Action-conditioned 3d human motion synthesis with transformer vae. In International Conference on Computer Vision (ICCV), 2021.
- Temos: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision (ECCV), 2022.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015.
- Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
- Goal: Generating 4d whole-body motion for hand-object grasping. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- OpenMask3D: Open-Vocabulary 3D Instance Segmentation. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision (ECCV), 2022.
- Human motion diffusion model. In International Conference on Learning Representations (ICLR), 2023.
- Language grounding with 3d objects. In Conference on Robot Learning (CoRL), 2022.
- Motion doodles: an interface for sketching character motion. ACM Transactions on Graphics (TOG), 23(3):424–431, 2004.
- Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In International Conference on Computer Vision (ICCV), 2019.
- Synthesizing long-term 3d human motion and interaction in 3d scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021a.
- Scene-aware generative network for human motion synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
- Towards diverse and natural scene-aware 3d human motion synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022a.
- Binge watching: Scaling affordance learning from sitcoms. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Humanise: Language-conditioned human motion generation in 3d scenes. In Advances in Neural Information Processing Systems (NeurIPS), 2022b.
- Learning generalizable dexterous manipulation from human grasp affordance. In Conference on Robot Learning (CoRL), 2023.
- Unified human-scene interaction via prompted chain-of-contacts. In International Conference on Learning Representations (ICLR), 2024.
- Physics-based human motion estimation and synthesis from videos. In International Conference on Computer Vision (ICCV), 2021.
- 3d question answering. IEEE Transactions on Visualization and Computer Graph (TVCG), pages 1–16, 2022.
- Generating holistic 3d human motion from speech. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Dlow: Diversifying latent flows for diverse human motion prediction. In European Conference on Computer Vision (ECCV), 2020.
- X-trans2cap: Cross-modal knowledge transfer using transformer for 3d dense captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- T2m-gpt: Generating human motion from textual descriptions with discrete representations. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
- Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022a.
- Place: Proximity learning of articulation and contact in 3d environments. In International Conference on 3D Vision (3DV), 2020a.
- Couch: Towards controllable human-chair interactions. In European Conference on Computer Vision (ECCV), 2022b.
- The wanderings of odysseus in 3d scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Generating 3d people in scenes without people. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020b.
- Multi3drefer: Grounding text description to multiple 3d objects. In International Conference on Computer Vision (ICCV), 2023b.
- Point transformer. In International Conference on Computer Vision (ICCV), 2021.
- Compositional human-scene interaction synthesis with semantic control. In European Conference on Computer Vision (ECCV), 2022.
- Synthesizing diverse human motions in 3d indoor scenes. In International Conference on Computer Vision (ICCV), 2023.
- Human motion generation: A survey. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pages 1–20, 2023a.
- Reasoning about object affordances in a knowledge base representation. In European Conference on Computer Vision (ECCV), 2014.
- Inferring forces and learning human utilities from videos. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- 3d-vista: Pre-trained transformer for 3d vision and text alignment. In International Conference on Computer Vision (ICCV), 2023b.
- Zan Wang (21 papers)
- Yixin Chen (126 papers)
- Baoxiong Jia (35 papers)
- Puhao Li (13 papers)
- Jinlu Zhang (14 papers)
- Jingze Zhang (1 paper)
- Tengyu Liu (27 papers)
- Yixin Zhu (102 papers)
- Wei Liang (76 papers)
- Siyuan Huang (123 papers)