LGTM: Local-to-Global Text-Driven Human Motion Diffusion Model (2405.03485v1)
Abstract: In this paper, we introduce LGTM, a novel Local-to-Global pipeline for Text-to-Motion generation. LGTM utilizes a diffusion-based architecture and aims to address the challenge of accurately translating textual descriptions into semantically coherent human motion in computer animation. Specifically, traditional methods often struggle with semantic discrepancies, particularly in aligning specific motions to the correct body parts. To address this issue, we propose a two-stage pipeline to overcome this challenge: it first employs LLMs to decompose global motion descriptions into part-specific narratives, which are then processed by independent body-part motion encoders to ensure precise local semantic alignment. Finally, an attention-based full-body optimizer refines the motion generation results and guarantees the overall coherence. Our experiments demonstrate that LGTM gains significant improvements in generating locally accurate, semantically-aligned human motion, marking a notable advancement in text-to-motion applications. Code and data for this paper are available at https://github.com/L-Sun/LGTM
- Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2pose: Natural Language Grounded Pose Forecasting. In 2019 International Conference on 3D Vision (3DV). IEEE, 719–728.
- Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models. ACM Transactions on Graphics 42, 4 (Aug. 2023), 1–20. https://doi.org/10.1145/3592458
- TEACH: Temporal Action Composition for 3D Humans. In 2022 International Conference on 3D Vision (3DV). IEEE Computer Society, 414–423. https://doi.org/10.1109/3DV57658.2022.00053
- PMP: Learning to Physically Interact with Environments Using Part-wise Motion Priors. In ACM SIGGRAPH 2023 Conference Proceedings (SIGGRAPH ’23). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/3588432.3591487
- Towards Efficient and Photorealistic 3D Human Reconstruction: A Brief Survey. Visual Informatics 5, 4 (Dec. 2021), 11–19. https://doi.org/10.1016/j.visinf.2021.10.003
- Executing Your Commands via Motion Diffusion in Latent Space. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Vancouver, BC, Canada, 18000–18010. https://doi.org/10.1109/CVPR52729.2023.01726
- Executing Your Commands via Motion Diffusion in Latent Space. https://doi.org/10.48550/arXiv.2212.04048 arXiv:2212.04048 [cs]
- Synthesis of Compositional Animations From Textual Descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1396–1406.
- Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv:2005.08100 [cs, eess]
- Generating Diverse and Natural 3D Human Motions From Text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5152–5161.
- TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV. Springer-Verlag, Berlin, Heidelberg, 580–597. https://doi.org/10.1007/978-3-031-19833-5_34
- Real-Time Motion Retargeting to Highly Varied User-Created Morphologies. ACM Transactions on Graphics 27, 3 (Aug. 2008), 1–11. https://doi.org/10.1145/1360612.1360626
- Motion Puzzle: Arbitrary Motion Style Transfer by Body Part. ACM Transactions on Graphics (Jan. 2022). https://doi.org/10.1145/3516429
- Enriching a Motion Database by Analogous Combination of Partial Human Motions. The Visual Computer 24, 4 (April 2008), 271–280. https://doi.org/10.1007/s00371-007-0200-1
- MotionGPT: Human Motion as a Foreign Language. https://doi.org/10.48550/arXiv.2306.14795 arXiv:2306.14795 [cs]
- Application of ChatGPT-Based Digital Human in Animation Creation. Future Internet 15, 9 (Sept. 2023), 300. https://doi.org/10.3390/fi15090300
- Learning Virtual Chimeras by Dynamic Motion Reassembly. ACM Transactions on Graphics 41, 6 (Nov. 2022), 182:1–182:13. https://doi.org/10.1145/3550454.3555489
- SMPL: A Skinned Multi-Person Linear Model. ACM Transactions on Graphics 34, 6 (Oct. 2015), 248:1–248:16. https://doi.org/10.1145/2816795.2818013
- TEMOS: Generating Diverse Human Motions from Textual Descriptions. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII. Springer-Verlag, Berlin, Heidelberg, 480–497. https://doi.org/10.1007/978-3-031-20047-2_28
- TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis. arXiv:2305.00976 [cs]
- DreamFusion: Text-to-3D Using 2D Diffusion. https://doi.org/10.48550/arXiv.2209.14988 arXiv:2209.14988 [cs, stat]
- Learning Transferable Visual Models From Natural Language Supervision. https://doi.org/10.48550/arXiv.2103.00020 arXiv:2103.00020 [cs]
- High-Resolution Image Synthesis with Latent Diffusion Models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New Orleans, LA, USA, 10674–10685. https://doi.org/10.1109/CVPR52688.2022.01042
- ChatGPT for Digital Forensic Investigation: The Good, the Bad, and the Unknown. Forensic Science International: Digital Investigation 46 (Oct. 2023), 301609. https://doi.org/10.1016/j.fsidi.2023.301609
- Body-Part Motion Synthesis System for Contemporary Dance Creation. In ACM SIGGRAPH 2016 Posters. ACM, Anaheim California, 1–2. https://doi.org/10.1145/2945078.2945107
- Denoising Diffusion Implicit Models. https://doi.org/10.48550/arXiv.2010.02502 arXiv:2010.02502 [cs]
- Neural State Machine for Character-Scene Interactions. ACM Transactions on Graphics 38, 6 (Nov. 2019), 209:1–209:14. https://doi.org/10.1145/3355089.3356505
- Local Motion Phases for Learning Multi-Contact Character Movements. ACM Transactions on Graphics 39, 4 (July 2020), 54:1–54:13. https://doi.org/10.1145/3386569.3392450
- Neural Animation Layering for Synthesizing Martial Arts Movements. ACM Transactions on Graphics 40, 4 (July 2021), 92:1–92:16. https://doi.org/10.1145/3450626.3459881
- MotionCLIP: Exposing Human Motion Generation to CLIP Space. In Computer Vision – ECCV 2022 (Lecture Notes in Computer Science), Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer Nature Switzerland, Cham, 358–374. https://doi.org/10.1007/978-3-031-20047-2_21
- Human Motion Diffusion Model. In The Eleventh International Conference on Learning Representations.
- Attention Is All You Need. In Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc.
- MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete Representations. ([n. d.]).
- PhysDiff: Physics-Guided Human Motion Diffusion Model. https://doi.org/10.48550/arXiv.2212.02500 arXiv:2212.02500 [cs]
- SmoothNet: A Plug-and-Play Network for Refining Human Poses in Videos. In European Conference on Computer Vision. Springer.
- Mode-Adaptive Neural Networks for Quadruped Motion Control. ACM Transactions on Graphics 37, 4 (July 2018), 145:1–145:11. https://doi.org/10.1145/3197517.3201366
- MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. arXiv:2208.15001 [cs]
- Metaverse: Perspectives from Graphics, Interactions and Visualization. Visual Informatics 6, 1 (March 2022), 56–67. https://doi.org/10.1016/j.visinf.2022.03.002