Human Motion Diffusion Model (2209.14916v2)

Published 29 Sep 2022 in cs.CV and cs.GR

Abstract: Natural and expressive human motion generation is the holy grail of computer animation. It is a challenging task, due to the diversity of possible motion, human perceptual sensitivity to it, and the difficulty of accurately describing it. Therefore, current generative solutions are either low-quality or limited in expressiveness. Diffusion models, which have already shown remarkable generative capabilities in other domains, are promising candidates for human motion due to their many-to-many nature, but they tend to be resource hungry and hard to control. In this paper, we introduce Motion Diffusion Model (MDM), a carefully adapted classifier-free diffusion-based generative model for the human motion domain. MDM is transformer-based, combining insights from motion generation literature. A notable design-choice is the prediction of the sample, rather than the noise, in each diffusion step. This facilitates the use of established geometric losses on the locations and velocities of the motion, such as the foot contact loss. As we demonstrate, MDM is a generic approach, enabling different modes of conditioning, and different generation tasks. We show that our model is trained with lightweight resources and yet achieves state-of-the-art results on leading benchmarks for text-to-motion and action-to-motion. https://guytevet.github.io/mdm-page/ .

Authors (6)

Guy Tevet (13 papers)
Sigal Raab (7 papers)
Brian Gordon (6 papers)
Yonatan Shafir (3 papers)
Daniel Cohen-Or (172 papers)
Amit H. Bermano (46 papers)

Citations (569)

View on Semantic Scholar

Summary

The paper introduces Motion Diffusion Model (MDM), a diffusion-based generative model tailored for human motion generation. MDM addresses the challenges in this domain, such as the diversity of possible motions, perceptual sensitivity, and the difficulty of accurate motion description. The model is designed to be lightweight and controllable, leveraging the many-to-many nature of diffusion models without being as resource-intensive as typical diffusion models.

MDM is transformer-based, foregoing the U-Net architecture typical in diffusion models. The model predicts the clean sample $\hat{x}_0$ rather than the noise in each diffusion step, which facilitates the use of geometric losses on motion locations and velocities, including a foot contact loss. The framework supports various conditioning modes, including text-to-motion, action-to-motion, and unconditioned generation.

Key aspects of the methodology include:

Diffusion Framework: The motion synthesis models the distribution $p(x_0|c)$ as the reversed diffusion process of gradually cleaning $x_T$ . The model predicts the signal itself, i.e., $\hat{x}_{0} = G(x_t, t, c)$ .
Geometric Losses: The model is regularized using geometric losses to enforce physical properties and prevent artifacts. These losses include position loss $\mathcal{L}_\text{pos}$ , foot contact loss $\mathcal{L}_\text{foot}$ , and velocity loss $\mathcal{L}_\text{vel}$ .
Model Architecture: The model uses a transformer encoder-only architecture. The noise time-step $t$ and the condition code $c$ are projected to the transformer dimension by separate feed-forward networks, then summed to yield the token $z_{tk}$ . Each frame of the noised input $x_t$ is linearly projected into the transformer dimension and summed with a standard positional embedding.

The model was evaluated on text-to-motion, action-to-motion, and unconditioned motion generation tasks. For text-to-motion, the model was tested on the HumanML3D and KIT datasets, achieving state-of-the-art results in terms of Fréchet Inception Distance (FID), diversity, and multimodality. A user paper indicated that human evaluators preferred MDM-generated motions over real motions 42% of the time. For action-to-motion, MDM outperformed existing methods on the HumanAct12 and UESTC datasets. The model also demonstrated capabilities in motion completion and editing by adapting diffusion image-inpainting techniques.

Experiments included:

Text-to-Motion: Evaluated on HumanML3D and KIT datasets using metrics such as R-precision, FID, multimodal distance, diversity, and multimodality. The model was compared against JL2P, Text2Gesture, and T2M.
Action-to-Motion: Evaluated on HumanAct12 and UESTC datasets using metrics such as FID, action recognition accuracy, diversity, and multimodality. The model was compared against Action2Motion, ACTOR, and INR.
Motion Editing: Demonstrated motion in-betweening and body part editing by adapting diffusion inpainting techniques.
Unconstrained Synthesis: Evaluated on an unconstrained version of the HumanAct12 dataset using metrics such as FID, Kernel Inception Distance (KID), precision/recall, and multimodality. The model was compared against ACTOR and MoDi.

The results indicate that MDM achieves state-of-the-art performance across several motion generation tasks, requiring only approximately three days of training on a single mid-range GPU. The model supports geometric losses and combines generative power with domain knowledge. A limitation of the diffusion approach is the long inference time due to the need for multiple forward passes.

PDF Markdown

Related Papers

Find Related Papers

GitHub

YouTube

Show All Videos