MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model (2208.15001v1)

Published 31 Aug 2022 in cs.CV

Abstract: Human motion modeling is important for many modern graphics applications, which typically require professional skills. In order to remove the skill barriers for laymen, recent motion generation methods can directly generate human motions conditioned on natural languages. However, it remains challenging to achieve diverse and fine-grained motion generation with various text inputs. To address this problem, we propose MotionDiffuse, the first diffusion model-based text-driven motion generation framework, which demonstrates several desired properties over existing methods. 1) Probabilistic Mapping. Instead of a deterministic language-motion mapping, MotionDiffuse generates motions through a series of denoising steps in which variations are injected. 2) Realistic Synthesis. MotionDiffuse excels at modeling complicated data distribution and generating vivid motion sequences. 3) Multi-Level Manipulation. MotionDiffuse responds to fine-grained instructions on body parts, and arbitrary-length motion synthesis with time-varied text prompts. Our experiments show MotionDiffuse outperforms existing SoTA methods by convincing margins on text-driven motion generation and action-conditioned motion generation. A qualitative analysis further demonstrates MotionDiffuse's controllability for comprehensive motion generation. Homepage: https://mingyuan-zhang.github.io/projects/MotionDiffuse.html

Citations (424)

View on Semantic Scholar

Summary

The paper introduces MotionDiffuse, a diffusion model framework that enhances text-driven human motion generation through probabilistic mapping and fine-grained control.
It employs a cross-modality linear transformer and noise interpolation to achieve independent control over body parts and synthesize arbitrary-length motion sequences.
Experimental results on multiple benchmarks demonstrate superior precision, lower FID, and higher diversity compared to state-of-the-art methods.

The paper introduces MotionDiffuse, a diffusion model-based framework for generating high-fidelity and controllable human motions from text descriptions. MotionDiffuse leverages a Denoising Diffusion Probabilistic Model (DDPM) to map text to motion through a series of denoising steps. The framework incorporates a cross-modality linear transformer to achieve motion synthesis of arbitrary length, conditioned on text prompts. The core idea is to guide the generation pipeline with input texts softly, increasing the diversity of the generated results, rather than learning a direct mapping between text and motion spaces.

The paper addresses the limitations of existing motion generation methods, which often struggle to achieve diverse and fine-grained motion generation with various text inputs. MotionDiffuse aims to tackle these challenges through three key properties: probabilistic mapping, realistic synthesis, and multi-level manipulation.

The methodology involves incorporating DDPM into motion generation, using a cross-modality linear transformer for motion synthesis with arbitrary length, and guiding the generation pipeline with input texts to increase diversity. To maintain uncertainties in the denoising process, the noise terms are processed conditioned on the input texts by several transformer decoder layers for each denoising step.

A key aspect of MotionDiffuse is its ability to achieve body part-independent control with fine-grained texts. The whole-body motion is divided into several near-independent parts (e.g., upper body and lower body), and a "noise interpolation" technique is proposed to separately control different body parts while considering their correlations. Additionally, to synthesize arbitrary-length motion sequences, a new sampling method is introduced to denoise several overlapped sequences simultaneously.

The architecture of MotionDiffuse consists of a text encoder and a motion decoder. The text encoder uses a classical transformer to extract text features, while the motion decoder employs linear self-attention and cross-attention mechanisms. Efficient Attention is adopted to speed up the self-attention module, reducing the time complexity from $\mathcal{O}(n^2d)$ to $\mathcal{O}(dd_kn)$ , where $n$ is the number of elements in the sequences, $d$ is the dimension of each element in $\mathbf{X}$ , and $k$ is the number of heads of self-attention. The denoising process uses a neural network $\epsilon_{\theta}(\mathbf{x}_t,t,\textrm{text})$ to estimate the noise added at each step $t$ . The model is trained by minimizing the mean squared error between the predicted noise and the actual noise.

The paper also explores two types of additional signals: part-aware text controlling and time-varied controlling. Part-aware text controlling assigns different text conditions to different body parts, enabling accurate control of each body part. Time-varied controlling divides the whole sequence into several parts and assigns independent text conditions for each interval, synthesizing arbitrary-length motion sequences that incorporate several actions.

The paper presents extensive experimental results on popular benchmarks, including HumanML3D, KIT-ML, HumanAct12, and UESTC datasets. MotionDiffuse demonstrates significant improvements over existing state-of-the-art methods in text-driven motion generation and action-conditioned motion generation tasks. The quantitative results show that MotionDiffuse achieves higher precision, lower Fréchet Inception Distance (FID), lower MultiModal Distance, and higher Diversity compared to other methods. Ablation studies are conducted to evaluate the impact of different components, such as the pretrained CLIP model and the efficient attention mechanism.

For text-driven motion generation, the pose states contain seven parts: $(r^{va},r^{vx}, r^{vz},r^h, \mathbf{j}^p, \mathbf{j}^v, \mathbf{j}^r)$ , where $r^{va},r^{vx}, r^{vz} \in \mathbb{R}$ denotes the root joint's angular velocity along Y-axis, linear velocity along X-axis and Z-axis, respectively; $r^h \in \mathbb{R}$ is the height of the root joint; $\mathbf{j}^p, \mathbf{j}^v \in \mathbb{R}^{J \times 3}$ are the position and linear velocity of each joint, where $J$ is the number of joints; and $\mathbf{j}^r \in \mathbb{R}^{J \times 6}$ is the 6D rotation of each joint.

For action-conditioned motion generation on the HumanAct12 dataset, each pose state is represented as $(\mathbf{j}^x, \mathbf{j}^y, \mathbf{j}^z)$ , where $\mathbf{j}^x, \mathbf{j}^y, \mathbf{j}^z \in \mathbb{R}^{24 \times 3}$ are the coordinates of 24 joints. For the UESTC dataset, the pose representation is $(r^x, r^y, r^z, \mathbf{j}^r)$ , where $r^x, r^y, r^z \in \mathbb{R}$ are the coordinates of the root joint, and $\mathbf{j}^r \in \mathbb{R}^{24 \times 6}$ is the rotation angle of each joint in 6D representation.

The paper also introduces two task variants: Spatially-diverse T2M task (T2M-S) and Temporally-diverse T2M task (T2M-T). T2M-S requires the generated motion sequence to contain multiple actions on different body parts, represented by a set of text-mask pairs $\{(\textrm{text}_{i,j}, \mathrm{M}_{i,j})\}$ , where $\mathrm{M}_{i,j} \in \{0,1\}^D$ is a binary vector indicating which body part to focus on. T2M-T expects models to generate a long motion sequence including multiple actions in a specific order over different time intervals, represented by text-duration pairs $\{\textrm{text}_{i,j}, [l_{i,j}, r_{i,j}]\}$ , where the motion clip from $l_{i,j}$ -th frame to $r_{i,j}$ -th frame contains the action $\textrm{text}_{i,j}$ .

PDF Markdown

Related Papers

GitHub

MotionDiffuse

Tweets

https://twitter.com/LeaMue27/status/1758867108983906508

YouTube

Show All Videos