GUESS:GradUally Enriching SyntheSis for Text-Driven Human Motion Generation
Abstract: In this paper, we propose a novel cascaded diffusion-based generative framework for text-driven human motion synthesis, which exploits a strategy named GradUally Enriching SyntheSis (GUESS as its abbreviation). The strategy sets up generation objectives by grouping body joints of detailed skeletons in close semantic proximity together and then replacing each of such joint group with a single body-part node. Such an operation recursively abstracts a human pose to coarser and coarser skeletons at multiple granularity levels. With gradually increasing the abstraction level, human motion becomes more and more concise and stable, significantly benefiting the cross-modal motion synthesis task. The whole text-driven human motion synthesis problem is then divided into multiple abstraction levels and solved with a multi-stage generation framework with a cascaded latent diffusion model: an initial generator first generates the coarsest human motion guess from a given text description; then, a series of successive generators gradually enrich the motion details based on the textual description and the previous synthesized results. Notably, we further integrate GUESS with the proposed dynamic multi-condition fusion mechanism to dynamically balance the cooperative effects of the given textual condition and synthesized coarse motion prompt in different generation stages. Extensive experiments on large-scale datasets verify that GUESS outperforms existing state-of-the-art methods by large margins in terms of accuracy, realisticness, and diversity. Code is available at https://github.com/Xuehao-Gao/GUESS.
- Language2pose: Natural language grounded pose forecasting. In International Conference on 3D Vision, pages 719–728, 2019.
- Rhythm is a dancer: Music-driven motion synthesis with global structure. IEEE Trans. Vis. Comput. Graph., 2023.
- Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In IEEE Virtual Reality and 3D User Interfaces, pages 160–169, 2021.
- Surface motion capture animation synthesis. IEEE Trans. Vis. Comput. Graph., 25(6):2270–2283, 2019.
- Multiple scales of representation along the hippocampal anteroposterior axis in humans. Current Biology, 2018.
- Implicit neural representations for variable length human motion generation. In European Conference on Computer Vision, volume 13677, pages 356–372, 2022.
- pages 64–74, 2023.
- Executing your commands via motion diffusion in latent space. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- CMU. Mocap dataset. http://mocap.cs.cmu.edu/.
- Mofusion: A framework for denoising-diffusion-based motion synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Episodic and semantic content of memory and imagination: A multilevel analysis. Memory &\&& Cognition, 2017.
- Example-based automatic music-driven conventional dance motion synthesis. IEEE Trans. Vis. Comput. Graph., 18(3):501–515, 2012.
- Decompose more and aggregate better: Two closer looks at frequency representation learning for human motion prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6451–6460, 2023.
- Glimpse and focus: Global and local-scale graph convolution network for skeleton-based action recognition. Neural Networks, 167:551–558, 2023.
- Contrastive self-supervised learning for skeleton action recognition. In NeurIPS 2020 Workshop on Pre-registration in Machine Learning, pages 51–61, 2021.
- Learning heterogeneous spatial–temporal context for skeleton-based action recognition. IEEE Transactions on Neural Networks and Learning Systems, 2023.
- Efficient spatio-temporal contrastive learning for skeleton-based 3d action recognition. IEEE Transactions on Multimedia, 2021.
- Synthesis of compositional animations from textual descriptions. In IEEE/CVF International Conference on Computer Vision, pages 1376–1386, 2021.
- Generating diverse and natural 3d human motions from text. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5142–5151, 2022.
- TM2T: stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In European Conference on Computer Vision, volume 13695, pages 580–597, 2022.
- Action2motion: Conditioned generation of 3d human motions. In ACM International Conference on Multimedia, pages 2021–2029, 2020.
- Imagen video: High definition video generation with diffusion models. arXiv, 2022.
- Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23:47:1–47:33, 2022.
- A two-part transformer network for controllable motion synthesis. IEEE Trans. Vis. Comput. Graph., 2023.
- A large-scale RGB-D database for arbitrary-view human action recognition. In ACM Multimedia Conference on Multimedia Conference, pages 1510–1518, 2018.
- Two-character motion analysis and synthesis. IEEE Trans. Vis. Comput. Graph., 14(3):707–720, 2008.
- In Advances in Neural Information Processing Systems, pages 3581–3591, 2019.
- Danceformer: Music conditioned 3d dance generation with parametric motion transformer. In Thirty-Sixth Conference on Artificial Intelligence, pages 1272–1279, 2022.
- AI choreographer: Music conditioned 3d dance generation with AIST++. In IEEE/CVF International Conference on Computer Vision, pages 13381–13392, 2021.
- Generating animated videos of human activities from natural language descriptions. In Advances in Neural Information Processing Systems, 2018.
- AMASS: archive of motion capture as surface shapes. In IEEE/CVF International Conference on Computer Vision, pages 5441–5450, 2019.
- The KIT whole-body human motion database. In International Conference on Advanced Robotics, pages 329–336, 2015.
- OpenAI. Official pre-trained clip models. https://github.com/openai/CLIP/.
- OpenAI. Chatgpt introduction. https://openai.com/blog/chatgpt.
- Joel Pearson. The human imagination: the cognitive neuroscience of visual mental imagery. Nature Reviews Neuroscience, 2019.
- Action-conditioned 3d human motion synthesis with transformer VAE. In IEEE/CVF International Conference on Computer Vision, pages 10965–10975, 2021.
- TEMOS: generating diverse human motions from textual descriptions. In European Conference on Computer Vision, volume 13682, pages 480–497, 2022.
- The KIT motion-language dataset. Big Data, 4(4):236–252, 2016.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, volume 139, pages 8748–8763, 2021.
- Hierarchical text-conditional image generation with clip latents. In Advances in Neural Information Processing Systems, 2022.
- High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10674–10685, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, volume 37, pages 2256–2265, 2015.
- Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems, 2020.
- Human motion diffusion model. In International Conference on Learning Representations, 2023.
- Mocogan: Decomposing motion and content for video generation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1526–1535, 2018.
- Combining recurrent neural networks and adversarial training for human motion synthesis and control. IEEE Trans. Vis. Comput. Graph., 27(1):14–28, 2021.
- Towards detailed text-to-motion synthesis via basic-to-advanced hierarchical diffusion model. In AAAI, 2023.
- Diffusion models: A comprehensive survey of methods and applications. arXiv, 2022.
- Motion guided attention learning for self-supervised 3d human action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8623–8634, 2022.
- T2M-GPT: generating human motion from textual descriptions with discrete representations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.