Large Motion Model for Unified Multi-Modal Motion Generation (2404.01284v1)

Published 1 Apr 2024 in cs.CV

Abstract: Human motion generation, a cornerstone technique in animation and video production, has widespread applications in various tasks like text-to-motion and music-to-dance. Previous works focus on developing specialist models tailored for each task without scalability. In this work, we present Large Motion Model (LMM), a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. A unified motion model is appealing since it can leverage a wide range of motion data to achieve broad generalization beyond a single task. However, it is also challenging due to the heterogeneous nature of substantially different motion data and tasks. LMM tackles these challenges from three principled aspects: 1) Data: We consolidate datasets with different modalities, formats and tasks into a comprehensive yet unified motion generation dataset, MotionVerse, comprising 10 tasks, 16 datasets, a total of 320k sequences, and 100 million frames. 2) Architecture: We design an articulated attention mechanism ArtAttention that incorporates body part-aware modeling into Diffusion Transformer backbone. 3) Pre-Training: We propose a novel pre-training strategy for LMM, which employs variable frame rates and masking forms, to better exploit knowledge from diverse training data. Extensive experiments demonstrate that our generalist LMM achieves competitive performance across various standard motion generation tasks over state-of-the-art specialist models. Notably, LMM exhibits strong generalization capabilities and emerging properties across many unseen tasks. Additionally, our ablation studies reveal valuable insights about training and scaling up large motion models for future research.

PDF HTML Abstract

Large Motion Model: A Unified Framework for Multi-Modal Motion Generation

Introduction to Large Motion Model (LMM)

The generation of human motion plays a pivotal role in applications spanning from animation to human-computer interaction. Current approaches often specialize in a singular task, such as text-to-motion or music-to-dance generation, limiting their scope and scalability. Addressing this limitation, the Large Motion Model (LMM) introduces a unified, multi-modal framework that combines various motion generation paradigms, including text-to-motion, music-to-dance, action-to-motion, and more, into a singular, cohesive model. This allows leveraging extensive motion data, promoting broad generalization and scalability across tasks. However, this endeavor faces the challenge of dealing with diverse motion data and distinct task requirements.

MotionVerse: The Unified Dataset

To tackle these challenges, the creation of MotionVerse, a mega-scale motion generation dataset, represents a crucial step. MotionVerse amalgamates data from 16 datasets, totaling over 320,000 sequences and 100 million frames, covering ten tasks. It offers a standard framework by consolidating disparate motion formats and evaluation metrics into a unified format, thus enabling straightforward model training and evaluation across different tasks. This standardization is achieved via the innovative TOMATO representation, which acts as a bridge for various motion data forms, and dedicated representation translators facilitating smooth transitions between different task-specific motion representations.

Architectural Highlights

At the heart of LMM lies the ArtAttention mechanism, integrated within a transformer-based diffusion model structure. This mechanism enables detailed, body part-aware motion generation, significantly refining the motion generation capabilities of the LMM. It ingenously handles the inherent heterogeneity in motion data through:

Body Part-aware Modeling: Decomposing motion data into separate segments for independent processing.
Flexible Conditioning: Supporting multi-condition inputs, facilitating robust generation capabilities even for unseen tasks.
Pre-Training Strategies: Utilizing diverse motion data through innovative training techniques like variable frame rates and masking, enhancing the model's generalization ability.

Empirical Evaluations

Extensive experiments across various benchmarks demonstrate LMM's superior performance, particularly in tasks such as text-to-motion and music-to-dance generation. Notably, LMM achieves competitive results against specialized models, showcasing its exceptional generalization across a wide range of motion generation tasks.

Theoretical and Practical Implications

LMM's comprehensive approach sheds light on several key insights for future research in large motion models. For instance, the effectiveness of body part-aware attention underlines the importance of modeling motion data at a granular level. Similarly, the proposed pre-training strategy illustrates the significant potential of unsupervised learning in exploiting large-scale, diverse datasets for motion generation.

Future Perspectives in AI and Motion Generation

The development and success of LMM hint at a promising direction for future work in generative AI, particularly in the domain of motion generation. By demonstrating the feasibility and effectiveness of a unified approach to multi-modal motion generation, LMM paves the way for more sophisticated, generalist models capable of tackling a wide variety of motion generation tasks with unprecedented flexibility and power.

Conclusion

The Large Motion Model marks a significant milestone in the journey toward unified, multi-modal motion generation, offering both a novel architectural blueprint and a comprehensive dataset in MotionVerse. Its implications stretch far beyond current applications, promising exciting developments in the realms of AI and human-computer interaction.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Mingyuan Zhang (41 papers)
Daisheng Jin (4 papers)
Chenyang Gu (14 papers)
Fangzhou Hong (38 papers)
Zhongang Cai (50 papers)
Jingfang Huang (7 papers)
Chongzhi Zhang (14 papers)
Xinying Guo (7 papers)
Lei Yang (372 papers)
Ying He (102 papers)
Ziwei Liu (368 papers)

Citations (15)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/liuziwei7/status/1775523653830033693

https://twitter.com/WilliamLamkin/status/1775527126692717033