GENMO: A Unified Framework for Human Motion Estimation and Generation
The paper introduces GENMO, a Generalist Model for Human Motion, aiming to bridge the gap between motion estimation and generation tasks within a single framework. Traditional methodologies in human motion modeling have separated motion estimation and generation into distinct tasks, often necessitating specialized models for each. Motion generation typically involves creating a diverse range of realistic motions, driven by inputs such as text, audio, or keyframes, while motion estimation focuses on reconstructing accurate motion trajectories from observations like videos. GENMO proposes a unified approach, transforming motion estimation into constrained motion generation, where outputs precisely satisfy observed conditioning signals. This is accomplished by leveraging the synergy between regression and diffusion mechanisms within a unified model.
Architectural Overview
The architecture of GENMO is rooted in diffusion models, incorporating a novel dual-mode training paradigm that includes an estimation mode and a generation mode. In the estimation mode, maximum likelihood estimation is used to train the model, focusing on precision in accurately reproducing observed motions. This approach ensures alignment with input video data, addressing the deterministic nature required for estimation tasks. Conversely, in the generation mode, a standard diffusion objective is employed, allowing the creation of diverse motion outputs conditioned by abstract inputs such as text or music. The innovation here lies in balancing the distinct objectives of these tasks, facilitating effective knowledge transfer across tasks while harnessing shared representations.
GENMO's architecture integrates various modalities, effectively handling videos, text descriptions, music, and both 2D and 3D keyframes by employing a rotary positional embedding (RoPE) and multi-text attention mechanisms. This enables variable-length sequence processing, accommodating diverse inputs without complex post-processing steps. Such architectural innovations provide seamless integration of multimodal conditions, allowing for fine-grained control over generated motion sequences.
Evaluation and Results
Extensive empirical evaluation demonstrates GENMO's proficiency in handling multiple human motion tasks within a single model, achieving state-of-the-art performance across diverse benchmarks. In global motion estimation, GENMO outperforms specialized models, producing more plausible estimations by integrating generative priors—particularly in scenarios marked by occlusions and dynamic scenes. In local motion estimation, GENMO compares favorably with leading models, exhibiting robustness and reduced artifact impacts even under challenging conditions.
For motion generation, GENMO excels in music-to-dance synthesis and text-conditioned motion generation, displaying enhanced diversity and physical plausibility. Particularly, the dual training paradigm and estimation-guided objective contribute significantly to superior motion generation quality compared to earlier diffusion models.
Implications and Future Directions
GENMO's unified framework opens promising avenues in human motion modeling, blending accurate estimation with expressive generation capabilities. The research underscores a shift towards models capable of handling multimodal inputs in a generalized manner—a crucial requirement in applications such as animation, virtual reality, and autonomous systems where human-mimicking motion is vital.
However, there are potential limitations and areas for further exploration. While GENMO successfully integrates motion estimation and generation, extending its capabilities to manage facial expressions and hand articulation remains an open challenge. Additionally, reliance on external SLAM methods for extracting camera parameters suggests opportunities for integrating these estimations within GENMO itself, augmenting its applicability across varied environments without subjective data preprocessing.
Overall, GENMO represents a significant advancement in unified human motion modeling, offering a versatile approach that synergizes the benefits of both motion generation and estimation tasks. Its success lies in effectively leveraging shared representations within a diffusion-based framework, paving the way for more comprehensive and adaptable models in the future.