The paper introduces Motion Diffusion Model (MDM), a diffusion-based generative model tailored for human motion generation. MDM addresses the challenges in this domain, such as the diversity of possible motions, perceptual sensitivity, and the difficulty of accurate motion description. The model is designed to be lightweight and controllable, leveraging the many-to-many nature of diffusion models without being as resource-intensive as typical diffusion models.
MDM is transformer-based, foregoing the U-Net architecture typical in diffusion models. The model predicts the clean sample x^0 rather than the noise in each diffusion step, which facilitates the use of geometric losses on motion locations and velocities, including a foot contact loss. The framework supports various conditioning modes, including text-to-motion, action-to-motion, and unconditioned generation.
Key aspects of the methodology include:
- Diffusion Framework: The motion synthesis models the distribution p(x0∣c) as the reversed diffusion process of gradually cleaning xT. The model predicts the signal itself, i.e., x^0=G(xt,t,c).
- Geometric Losses: The model is regularized using geometric losses to enforce physical properties and prevent artifacts. These losses include position loss Lpos, foot contact loss Lfoot, and velocity loss Lvel.
- Model Architecture: The model uses a transformer encoder-only architecture. The noise time-step t and the condition code c are projected to the transformer dimension by separate feed-forward networks, then summed to yield the token ztk. Each frame of the noised input xt is linearly projected into the transformer dimension and summed with a standard positional embedding.
The model was evaluated on text-to-motion, action-to-motion, and unconditioned motion generation tasks. For text-to-motion, the model was tested on the HumanML3D and KIT datasets, achieving state-of-the-art results in terms of Fréchet Inception Distance (FID), diversity, and multimodality. A user paper indicated that human evaluators preferred MDM-generated motions over real motions 42% of the time. For action-to-motion, MDM outperformed existing methods on the HumanAct12 and UESTC datasets. The model also demonstrated capabilities in motion completion and editing by adapting diffusion image-inpainting techniques.
Experiments included:
- Text-to-Motion: Evaluated on HumanML3D and KIT datasets using metrics such as R-precision, FID, multimodal distance, diversity, and multimodality. The model was compared against JL2P, Text2Gesture, and T2M.
- Action-to-Motion: Evaluated on HumanAct12 and UESTC datasets using metrics such as FID, action recognition accuracy, diversity, and multimodality. The model was compared against Action2Motion, ACTOR, and INR.
- Motion Editing: Demonstrated motion in-betweening and body part editing by adapting diffusion inpainting techniques.
- Unconstrained Synthesis: Evaluated on an unconstrained version of the HumanAct12 dataset using metrics such as FID, Kernel Inception Distance (KID), precision/recall, and multimodality. The model was compared against ACTOR and MoDi.
The results indicate that MDM achieves state-of-the-art performance across several motion generation tasks, requiring only approximately three days of training on a single mid-range GPU. The model supports geometric losses and combines generative power with domain knowledge. A limitation of the diffusion approach is the long inference time due to the need for multiple forward passes.