Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Motion Generation from Fine-grained Textual Descriptions (2403.13518v2)

Published 20 Mar 2024 in cs.AI, cs.CL, cs.CV, and cs.RO

Abstract: The task of text2motion is to generate human motion sequences from given textual descriptions, where the model explores diverse mappings from natural language instructions to human body movements. While most existing works are confined to coarse-grained motion descriptions, e.g., "A man squats.", fine-grained descriptions specifying movements of relevant body parts are barely explored. Models trained with coarse-grained texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure to generate motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with step-by-step instructions with pseudo-code compulsory checks. Accordingly, we design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information. Our quantitative evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines. According to the qualitative evaluation and case study, our model outperforms MotionDiffuse in generating spatially or chronologically composite motions, by learning the implicit mappings from fine-grained descriptions to the corresponding basic motions. We release our data at https://github.com/KunhangL/finemotiondiffuse.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2pose: Natural language grounded pose forecasting. 2019 International Conference on 3D Vision (3DV), pages 719–728.
  2. Teach: Temporal action composition for 3d humans. 2022 International Conference on 3D Vision (3DV), pages 414–423.
  3. Sinc: Spatial composition of 3d human motions for simultaneous action generation. ArXiv, abs/2304.10417.
  4. Language models are few-shot learners. ArXiv, abs/2005.14165.
  5. How robust is gpt-3.5 to predecessors? a comprehensive study on language understanding tasks. ArXiv, abs/2303.00293.
  6. Synthesis of compositional animations from textual descriptions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1376–1386.
  7. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. ArXiv, abs/2207.01696.
  8. Generating diverse and natural 3d human motions from text. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5142–5151.
  9. Action2motion: Conditioned generation of 3d human motions. Proceedings of the 28th ACM International Conference on Multimedia.
  10. A motion matching-based framework for controllable gesture synthesis from speech. ACM SIGGRAPH 2022 Conference Proceedings.
  11. Stochastic scene-aware motion prediction. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11354–11364.
  12. Action-gpt: Leveraging large-scale language models for improved and generalized zero shot action generation. ArXiv, abs/2211.15603.
  13. Flame: Free-form language-based motion synthesis & editing. ArXiv, abs/2209.00349.
  14. Ai choreographer: Music conditioned 3d dance generation with aist++. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 13381–13392.
  15. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  16. Temos: Generating diverse human motions from textual descriptions. ArXiv, abs/2204.14109.
  17. The KIT motion-language dataset. Big Data, 4(4):236–252.
  18. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  19. Human motion diffusion model. ArXiv, abs/2209.14916.
  20. Efficient diffusion models for vision: A survey. ArXiv, abs/2210.09292.
  21. Attention is all you need. Advances in neural information processing systems, 30.
  22. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  23. Diffusion models: A comprehensive survey of methods and applications. ArXiv, abs/2209.00796.
  24. T2m-gpt: Generating human motion from textual descriptions with discrete representations. ArXiv, abs/2301.06052.
  25. Motiondiffuse: Text-driven human motion generation with diffusion model. ArXiv, abs/2208.15001.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Kunhang Li (3 papers)
  2. Yansong Feng (81 papers)

Summary

We haven't generated a summary for this paper yet.