Papers
Topics
Authors
Recent
Search
2000 character limit reached

GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion Generation

Published 4 Jan 2024 in cs.CV | (2401.02142v2)

Abstract: In this paper, we propose a novel cascaded diffusion-based generative framework for text-driven human motion synthesis, which exploits a strategy named GradUally Enriching SyntheSis (GUESS as its abbreviation). The strategy sets up generation objectives by grouping body joints of detailed skeletons in close semantic proximity together and then replacing each of such joint group with a single body-part node. Such an operation recursively abstracts a human pose to coarser and coarser skeletons at multiple granularity levels. With gradually increasing the abstraction level, human motion becomes more and more concise and stable, significantly benefiting the cross-modal motion synthesis task. The whole text-driven human motion synthesis problem is then divided into multiple abstraction levels and solved with a multi-stage generation framework with a cascaded latent diffusion model: an initial generator first generates the coarsest human motion guess from a given text description; then, a series of successive generators gradually enrich the motion details based on the textual description and the previous synthesized results. Notably, we further integrate GUESS with the proposed dynamic multi-condition fusion mechanism to dynamically balance the cooperative effects of the given textual condition and synthesized coarse motion prompt in different generation stages. Extensive experiments on large-scale datasets verify that GUESS outperforms existing state-of-the-art methods by large margins in terms of accuracy, realisticness, and diversity. Code is available at https://github.com/Xuehao-Gao/GUESS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Language2pose: Natural language grounded pose forecasting. In International Conference on 3D Vision, pages 719–728, 2019.
  2. Rhythm is a dancer: Music-driven motion synthesis with global structure. IEEE Trans. Vis. Comput. Graph., 2023.
  3. Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In IEEE Virtual Reality and 3D User Interfaces, pages 160–169, 2021.
  4. Surface motion capture animation synthesis. IEEE Trans. Vis. Comput. Graph., 25(6):2270–2283, 2019.
  5. Multiple scales of representation along the hippocampal anteroposterior axis in humans. Current Biology, 2018.
  6. Implicit neural representations for variable length human motion generation. In European Conference on Computer Vision, volume 13677, pages 356–372, 2022.
  7. pages 64–74, 2023.
  8. Executing your commands via motion diffusion in latent space. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  9. CMU. Mocap dataset. http://mocap.cs.cmu.edu/.
  10. Mofusion: A framework for denoising-diffusion-based motion synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  11. Episodic and semantic content of memory and imagination: A multilevel analysis. Memory &\&& Cognition, 2017.
  12. Example-based automatic music-driven conventional dance motion synthesis. IEEE Trans. Vis. Comput. Graph., 18(3):501–515, 2012.
  13. Decompose more and aggregate better: Two closer looks at frequency representation learning for human motion prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6451–6460, 2023.
  14. Glimpse and focus: Global and local-scale graph convolution network for skeleton-based action recognition. Neural Networks, 167:551–558, 2023.
  15. Contrastive self-supervised learning for skeleton action recognition. In NeurIPS 2020 Workshop on Pre-registration in Machine Learning, pages 51–61, 2021.
  16. Learning heterogeneous spatial–temporal context for skeleton-based action recognition. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  17. Efficient spatio-temporal contrastive learning for skeleton-based 3d action recognition. IEEE Transactions on Multimedia, 2021.
  18. Synthesis of compositional animations from textual descriptions. In IEEE/CVF International Conference on Computer Vision, pages 1376–1386, 2021.
  19. Generating diverse and natural 3d human motions from text. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5142–5151, 2022.
  20. TM2T: stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In European Conference on Computer Vision, volume 13695, pages 580–597, 2022.
  21. Action2motion: Conditioned generation of 3d human motions. In ACM International Conference on Multimedia, pages 2021–2029, 2020.
  22. Imagen video: High definition video generation with diffusion models. arXiv, 2022.
  23. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47:1–47:33, 2022.
  24. A two-part transformer network for controllable motion synthesis. IEEE Trans. Vis. Comput. Graph., 2023.
  25. A large-scale RGB-D database for arbitrary-view human action recognition. In ACM Multimedia Conference on Multimedia Conference, pages 1510–1518, 2018.
  26. Two-character motion analysis and synthesis. IEEE Trans. Vis. Comput. Graph., 14(3):707–720, 2008.
  27. In Advances in Neural Information Processing Systems, pages 3581–3591, 2019.
  28. Danceformer: Music conditioned 3d dance generation with parametric motion transformer. In Thirty-Sixth Conference on Artificial Intelligence, pages 1272–1279, 2022.
  29. AI choreographer: Music conditioned 3d dance generation with AIST++. In IEEE/CVF International Conference on Computer Vision, pages 13381–13392, 2021.
  30. Generating animated videos of human activities from natural language descriptions. In Advances in Neural Information Processing Systems, 2018.
  31. AMASS: archive of motion capture as surface shapes. In IEEE/CVF International Conference on Computer Vision, pages 5441–5450, 2019.
  32. The KIT whole-body human motion database. In International Conference on Advanced Robotics, pages 329–336, 2015.
  33. OpenAI. Official pre-trained clip models. https://github.com/openai/CLIP/.
  34. OpenAI. Chatgpt introduction. https://openai.com/blog/chatgpt.
  35. Joel Pearson. The human imagination: the cognitive neuroscience of visual mental imagery. Nature Reviews Neuroscience, 2019.
  36. Action-conditioned 3d human motion synthesis with transformer VAE. In IEEE/CVF International Conference on Computer Vision, pages 10965–10975, 2021.
  37. TEMOS: generating diverse human motions from textual descriptions. In European Conference on Computer Vision, volume 13682, pages 480–497, 2022.
  38. The KIT motion-language dataset. Big Data, 4(4):236–252, 2016.
  39. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, volume 139, pages 8748–8763, 2021.
  40. Hierarchical text-conditional image generation with clip latents. In Advances in Neural Information Processing Systems, 2022.
  41. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10674–10685, 2022.
  42. Photorealistic text-to-image diffusion models with deep language understanding. arXiv, 2022.
  43. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, volume 37, pages 2256–2265, 2015.
  44. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems, 2020.
  45. Human motion diffusion model. In International Conference on Learning Representations, 2023.
  46. Mocogan: Decomposing motion and content for video generation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1526–1535, 2018.
  47. Combining recurrent neural networks and adversarial training for human motion synthesis and control. IEEE Trans. Vis. Comput. Graph., 27(1):14–28, 2021.
  48. Towards detailed text-to-motion synthesis via basic-to-advanced hierarchical diffusion model. In AAAI, 2023.
  49. Diffusion models: A comprehensive survey of methods and applications. arXiv, 2022.
  50. Motion guided attention learning for self-supervised 3d human action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8623–8634, 2022.
  51. T2M-GPT: generating human motion from textual descriptions with discrete representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  52. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv, 2022.
Citations (7)

Summary

  • The paper presents a cascaded diffusion-based framework that incrementally refines human motion synthesis from coarse abstractions to detailed actions based on text.
  • It employs a latent conditional diffusion model combined with dynamic multi-condition fusion to balance textual cues with synthesized motion prompts at each refinement stage.
  • Experimental results on HumanML3D and KIT-ML datasets show notable improvements in R-Precision and FID scores, enhancing accuracy, realism, and output diversity.

Exploration of GradUally Enriching SyntheSis for Text-Driven Human Motion Generation

The paper "GUESS: GradUally Enriching SyntheSis for Text-Driven Human Motion Generation" presents a novel approach towards the task of human motion synthesis driven by textual descriptions. The primary contribution is a cascaded diffusion-based generative framework termed as GUESS, which promises to significantly enhance the quality of text-driven motion generation by gradually enriching synthesis scales.

GUESS introduces a structured approach to human motion synthesis by breaking down the problem into multiple abstraction levels. Instead of generating detailed joint-based motion directly from text— a practice that faces challenges due to the disparities between textual and motion modalities—GUESS adopts a strategy of iteratively increasing the granularity of motion abstraction. It begins with a coarse generation capturing key motion characteristics and then progressively refines the granularity by enriching the motion’s details through a multilayered approach. This gradual refinement from coarse to fine helps stabilize the motion synthesis process and ensures better alignment with the cross-modal (text-to-motion) synthesis goals.

The proposed method integrates a latent conditional diffusion model that sequentially processes human motion at various levels using a multi-stage generation framework. Initially, the system generates a "coarse motion guess" from textual input, which serves as a baseline upon which further refinements are built. This refinement is achieved via successive generators, each responsible for adding more detail to the motion while being guided by both textual descriptions and previous synthetic results.

Further enhancing the GUESS framework is the incorporation of a dynamic multi-condition fusion mechanism. This mechanism adjusts how strongly textual conditions and synthesized coarse motion prompts influence each stage of generation dynamically. This fusion is critical, as it allows the model to give contextually appropriate weight to various input forms, optimizing synthesis across different stages.

The experimental validation on large-scale data evidences the effectiveness of GUESS, showing that it outperforms current state-of-the-art methods in accuracy, realism, and diversity by notable margins. This suggests that the structured processing and the multi-layer abstraction strategy provide a significant advantage in tackling the complexities of motion synthesis across modalities.

Numerical Results and Key Observations

The paper reports comprehensive evaluations, articulating GUESS’s superiority with marked improvements in R-Precision and FID scores, indicative of better text-motion alignment and realism. For instance, on the HumanML3D and KIT-ML datasets, the method not only displays improved retrieval accuracies but also illustrates heightened motion fidelity. Additionally, diversity and multimodality metrics underscore its capacity to produce varied outputs from identical textual descriptions, a critical facet in realistic motion generation.

Implications and Future Directions

The implications of GUESS extend into various domains such as virtual reality, gaming, and animation, where the need for high-quality, text-driven motion synthesis is paramount. Practically, GUESS could pave the way for more interactive and responsive virtual environments where user interaction can dynamically influence character motions.

Theoretically, the research sets the stage for further exploration into multi-stage generation strategies. Future work may explore adaptive strategies for selecting abstraction levels based on input specifics, or even broaden its applicability to incorporate other modalities like audio or real-time response systems.

In summary, GUESS embodies a sophisticated and structured approach to human motion synthesis, offering a fresh perspective and substantial results. It sets a solid foundation for advancing interactive AI, ensuring that the procedural relationship between input and generated outputs is both coherent and scalable. The paper's insights can propel the development of more nuanced and user-aligned generative models across various fields in artificial intelligence.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.