PhysDiff: Physics-Guided Human Motion Diffusion Model
(2212.02500v3)
Published 5 Dec 2022 in cs.CV, cs.AI, cs.GR, and cs.LG
Abstract: Denoising diffusion models hold great promise for generating diverse and realistic human motions. However, existing motion diffusion models largely disregard the laws of physics in the diffusion process and often generate physically-implausible motions with pronounced artifacts such as floating, foot sliding, and ground penetration. This seriously impacts the quality of generated motions and limits their real-world application. To address this issue, we present a novel physics-guided motion diffusion model (PhysDiff), which incorporates physical constraints into the diffusion process. Specifically, we propose a physics-based motion projection module that uses motion imitation in a physics simulator to project the denoised motion of a diffusion step to a physically-plausible motion. The projected motion is further used in the next diffusion step to guide the denoising diffusion process. Intuitively, the use of physics in our model iteratively pulls the motion toward a physically-plausible space, which cannot be achieved by simple post-processing. Experiments on large-scale human motion datasets show that our approach achieves state-of-the-art motion quality and improves physical plausibility drastically (>78% for all datasets).
The paper introduces PhysDiff, a physics-guided motion diffusion model that embeds physical constraints into the diffusion process to eliminate common motion artifacts.
It utilizes a physics-based motion projection module that employs a motion imitation policy in a simulator to ensure generated motions adhere to physical laws.
Experiments on text-to-motion and action-to-motion tasks show significant improvements, reducing physical errors by up to 94% and enhancing motion quality.
The paper introduces PhysDiff, a physics-guided motion diffusion model designed to generate human motions that adhere to the laws of physics, addressing the common issue of physically implausible motions generated by existing motion diffusion models. These models often produce motions with artifacts like floating, foot sliding, and ground penetration, which limits their applicability in real-world scenarios. PhysDiff incorporates physical constraints into the diffusion process through a physics-based motion projection module. This module projects the denoised motion of a diffusion step onto a physically plausible space using motion imitation within a physics simulator. The projected motion then guides subsequent denoising steps, iteratively refining the motion towards physical realism.
The core idea is to embed physics constraints directly into the diffusion process, rather than applying them as a post-processing step. The authors argue that post-processing can be ineffective because the final denoised kinematic motion may deviate too significantly from physical plausibility to be corrected adequately. By iteratively applying physics and diffusion, PhysDiff maintains proximity to the data distribution while converging towards physically plausible motions.
The physics-based motion projection module at the heart of PhysDiff enforces physical constraints through motion imitation in a physics simulator. A motion imitation policy, trained on large-scale motion capture data, controls a character agent within the simulator to mimic a wide range of input motions. This process ensures that the resulting simulated motion adheres to physical laws, eliminating artifacts like floating, foot sliding, and ground penetration.
The authors evaluate PhysDiff on text-to-motion generation and action-to-motion generation tasks. The model's denoiser can be any motion-denoising network; the authors test two state-of-the-art motion diffusion models, MDM (Motion Diffusion Model) and MotionDiffuse, as denoisers within PhysDiff. On the HumanML3D benchmark for text-to-motion generation, PhysDiff demonstrates significant improvements over existing motion diffusion models, reducing physical errors by over 86% while enhancing motion quality by more than 20%, as measured by Fréchet Inception Distance (FID). For action-to-motion generation, PhysDiff achieves substantial reductions in physical error metrics on the HumanAct12 (78% improvement) and UESTC (94% improvement) datasets, along with competitive FID scores.
Further experiments explore various schedules for the physics-based projection, revealing a trade-off between physical plausibility and motion quality as the number of projection steps varies. While increasing projection steps consistently improves physical plausibility, motion quality initially improves but then declines beyond a certain number of steps. This observation suggests a need to balance the number of physics-based projection steps to achieve both high physical plausibility and motion quality. The paper also finds that incorporating the physics-based projection in later diffusion steps yields better performance than applying it in earlier steps. The authors hypothesize that motions from early diffusion steps tend towards the mean motion of the training data, and physics-based projection could inadvertently push the motion away from the data distribution.
The contributions of this work include:
The introduction of PhysDiff, a physics-guided motion diffusion model that integrates the laws of physics into the diffusion process for generating physically plausible motions. The model is designed with a plug-and-play architecture, enabling it to be used with different kinematic diffusion models.
The use of human motion imitation in a physics simulator as a motion projection module to enforce physical constraints.
The demonstration of state-of-the-art performance in motion quality and a significant improvement in physical plausibility on large-scale motion datasets. The analysis provides insights into the schedules and trade-offs involved and demonstrates improvements over physics-based post-processing techniques.
The paper details related work in several key areas:
Denoising Diffusion Models: The authors cite a range of works on score-based denoising diffusion models and their applications in image generation, text-to-speech synthesis, 3D shape generation, machine learning security, and human motion generation. They also mention techniques for conditional generation, such as classifier-free guidance, and methods for solving linear inverse problems by injecting known information into the diffusion process.
Human Motion Generation: The paper reviews early work on deterministic human motion modeling and more recent work using deep generative models like GANs and VAEs to generate motions from various conditions, including past motions, key frames, music, text, and action labels. It also discusses the emergence of motion diffusion models and their state-of-the-art motion generation performance.
Physics-Based Human Motion Modeling: The authors discuss the application of physics-based human motion imitation to learning locomotion skills with deep reinforcement learning (RL). They also mention the use of RL-based motion imitation for user-controllable character animation and physics-based trajectory optimization and motion imitation for 3D human pose estimation.
The PhysDiff method leverages a physics-guided motion diffusion process, incorporating a physics-based motion projection Pπ to map motions to a physically plausible space. This projection uses a motion imitation policy π, trained to control a simulated character to mimic denoised motions x1:H within a physics simulator.
The motion diffusion process begins with a data distribution p0(x), and defines time-dependent distributions pt(xt) through the injection of Gaussian noise. The sampling process involves solving the stochastic differential equation (SDE):
$\mathrm{d} x = - (\beta_t + \dot{\sigma}_t) \sigma_t \nabla_{x \log p_t(x) \mathrm{d} t + \sqrt{2 \beta_t} \sigma_t \mathrm{d} \omega_t$
where:
x is the motion sample.
βt controls the amount of stochastic noise injected in the process.
σt defines a series of noise levels that increase over time.
∇xlogpt(x) is the score function.
ωt is the standard Wiener process.
The score function ∇xtlogpt(xt) recovers the minimum mean squared error (MMSE) estimator of x given xt:
x:=E[x∣xt]=xt+σt2∇xtlogpt(xt)
where:
x is a denoised version of xt.
The denoising autoencoder objective approximates the score function:
The DDIM sampling algorithm is used to perform a one-step update from time t to time s (s<t):
μs:=x+σtσs2−vs(xt−x)
where:
μs is the mean.
vs is the variance.
To incorporate physical constraints, the physics-based motion projection Pπ maps the motion x1:H to a physically-plausible motion x1:H=Pπ(x1:H).
The authors also discuss scheduling the physics-based projection, suggesting that it should not be performed when the diffusion noise level is high, as it can push the motion away from the data distribution.
The physics-based motion projection Pπ is achieved by learning a motion imitation policy π that controls a simulated character to mimic the denoised motion x1:H in a physics simulator. This is formulated as a Markov decision process (MDP) M=(S,A,T,R,γ), where a character agent acts according to a policy π(ah∣sh).
The state sh consists of the character's physical state, the input motion's next pose xh+1, and a character attribute vector ψ. The agent iteratively samples an action ah from the policy π, and the simulator generates the next state sh+1, from which the simulated pose xh+1 is extracted.
During training, a reward rh is assigned based on the alignment between the simulated motion x1:H and the ground-truth motion x1:H. Reinforcement learning (RL) is used to learn the policy π, maximizing the expected discounted return J(π)=Eπ[h∑γhrh].
The reward function consists of four sub-rewards:
rh=wprph+wvrvh+wjrjh+wqrqh
where:
rph is the pose reward.
rvh is the velocity reward.
rjh is the joint position reward.
rqh is the joint rotation reward.
wp,wv,wj,wq are weighting factors.
The agent state sh includes the character's joint angles, joint velocities, rigid bodies' positions, rotations, and linear and angular velocities, as well as the difference of xh+1 with respect to the agent. The character attribute ψ includes the gender and SMPL (Skinned Multi-Person Linear Model) shape parameters.
The action representation uses target joint angles of proportional derivative (PD) controllers and residual forces. A Gaussian policy π(ah∣sh)=N(μθ(sh),Σ) is used, where the mean action μθ is output by a multi-layer perceptron (MLP) network.
The experiments involved two standard human motion generation tasks: text-to-motion and action-to-motion generation. Evaluation metrics included FID, R-Precision, Accuracy, Penetrate, Float, Skate, and Phys-Err.
The authors compared PhysDiff against state-of-the-art methods on the HumanML3D, HumanAct12, and UESTC datasets. The results demonstrated that PhysDiff achieves state-of-the-art FID and reduces Phys-Err significantly. The authors also analyzed the schedule of the physics-based projection, varying the number and placement of projection steps. They compared different schedules, including uniform, start-end, and end-spaced schedules. The results indicated that it is better to schedule the physics-based projection steps consecutively towards the end of the diffusion process. Finally, PhysDiff was compared against a post-processing baseline, demonstrating that iterative application of diffusion and physics is more effective than post-processing.