GENMO: A GENeralist Model for Human MOtion (2505.01425v1)

Published 2 May 2025 in cs.GR, cs.AI, cs.CV, cs.LG, and cs.RO

Abstract: Human motion modeling traditionally separates motion generation and estimation into distinct tasks with specialized models. Motion generation models focus on creating diverse, realistic motions from inputs like text, audio, or keyframes, while motion estimation models aim to reconstruct accurate motion trajectories from observations like videos. Despite sharing underlying representations of temporal dynamics and kinematics, this separation limits knowledge transfer between tasks and requires maintaining separate models. We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework. Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals. Leveraging the synergy between regression and diffusion, GENMO achieves accurate global motion estimation while enabling diverse motion generation. We also introduce an estimation-guided training objective that exploits in-the-wild videos with 2D annotations and text descriptions to enhance generative diversity. Furthermore, our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control. This unified approach creates synergistic benefits: generative priors improve estimated motions under challenging conditions like occlusions, while diverse video data enhances generation capabilities. Extensive experiments demonstrate GENMO's effectiveness as a generalist framework that successfully handles multiple human motion tasks within a single model.

Summary

GENMO: A Unified Framework for Human Motion Estimation and Generation

The paper introduces GENMO, a Generalist Model for Human Motion, aiming to bridge the gap between motion estimation and generation tasks within a single framework. Traditional methodologies in human motion modeling have separated motion estimation and generation into distinct tasks, often necessitating specialized models for each. Motion generation typically involves creating a diverse range of realistic motions, driven by inputs such as text, audio, or keyframes, while motion estimation focuses on reconstructing accurate motion trajectories from observations like videos. GENMO proposes a unified approach, transforming motion estimation into constrained motion generation, where outputs precisely satisfy observed conditioning signals. This is accomplished by leveraging the synergy between regression and diffusion mechanisms within a unified model.

Architectural Overview

The architecture of GENMO is rooted in diffusion models, incorporating a novel dual-mode training paradigm that includes an estimation mode and a generation mode. In the estimation mode, maximum likelihood estimation is used to train the model, focusing on precision in accurately reproducing observed motions. This approach ensures alignment with input video data, addressing the deterministic nature required for estimation tasks. Conversely, in the generation mode, a standard diffusion objective is employed, allowing the creation of diverse motion outputs conditioned by abstract inputs such as text or music. The innovation here lies in balancing the distinct objectives of these tasks, facilitating effective knowledge transfer across tasks while harnessing shared representations.

GENMO's architecture integrates various modalities, effectively handling videos, text descriptions, music, and both 2D and 3D keyframes by employing a rotary positional embedding (RoPE) and multi-text attention mechanisms. This enables variable-length sequence processing, accommodating diverse inputs without complex post-processing steps. Such architectural innovations provide seamless integration of multimodal conditions, allowing for fine-grained control over generated motion sequences.

Evaluation and Results

Extensive empirical evaluation demonstrates GENMO's proficiency in handling multiple human motion tasks within a single model, achieving state-of-the-art performance across diverse benchmarks. In global motion estimation, GENMO outperforms specialized models, producing more plausible estimations by integrating generative priors—particularly in scenarios marked by occlusions and dynamic scenes. In local motion estimation, GENMO compares favorably with leading models, exhibiting robustness and reduced artifact impacts even under challenging conditions.

For motion generation, GENMO excels in music-to-dance synthesis and text-conditioned motion generation, displaying enhanced diversity and physical plausibility. Particularly, the dual training paradigm and estimation-guided objective contribute significantly to superior motion generation quality compared to earlier diffusion models.

Implications and Future Directions

GENMO's unified framework opens promising avenues in human motion modeling, blending accurate estimation with expressive generation capabilities. The research underscores a shift towards models capable of handling multimodal inputs in a generalized manner—a crucial requirement in applications such as animation, virtual reality, and autonomous systems where human-mimicking motion is vital.

However, there are potential limitations and areas for further exploration. While GENMO successfully integrates motion estimation and generation, extending its capabilities to manage facial expressions and hand articulation remains an open challenge. Additionally, reliance on external SLAM methods for extracting camera parameters suggests opportunities for integrating these estimations within GENMO itself, augmenting its applicability across varied environments without subjective data preprocessing.

Overall, GENMO represents a significant advancement in unified human motion modeling, offering a versatile approach that synergizes the benefits of both motion generation and estimation tasks. Its success lies in effectively leveraging shared representations within a diffusion-based framework, paving the way for more comprehensive and adaptable models in the future.