Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhanced Fine-grained Motion Diffusion for Text-driven Human Motion Synthesis (2305.13773v2)

Published 23 May 2023 in cs.CV

Abstract: The emergence of text-driven motion synthesis technique provides animators with great potential to create efficiently. However, in most cases, textual expressions only contain general and qualitative motion descriptions, while lack fine depiction and sufficient intensity, leading to the synthesized motions that either (a) semantically compliant but uncontrollable over specific pose details, or (b) even deviates from the provided descriptions, bringing animators with undesired cases. In this paper, we propose DiffKFC, a conditional diffusion model for text-driven motion synthesis with KeyFrames Collaborated, enabling realistic generation with collaborative and efficient dual-level control: coarse guidance at semantic level, with only few keyframes for direct and fine-grained depiction down to body posture level. Unlike existing inference-editing diffusion models that incorporate conditions without training, our conditional diffusion model is explicitly trained and can fully exploit correlations among texts, keyframes and the diffused target frames. To preserve the control capability of discrete and sparse keyframes, we customize dilated mask attention modules where only partial valid tokens participate in local-to-global attention, indicated by the dilated keyframe mask. Additionally, we develop a simple yet effective smoothness prior, which steers the generated frames towards seamless keyframe transitions at inference. Extensive experiments show that our model not only achieves state-of-the-art performance in terms of semantic fidelity, but more importantly, is able to satisfy animator requirements through fine-grained guidance without tedious labor.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Text2action: Generative adversarial synthesis from language to action. In ICRA, 5915–5920.
  2. Language2pose: Natural language grounded pose forecasting. In 3DV, 719–728. IEEE.
  3. Rhythm is a Dancer: Music-Driven Motion Synthesis with Global Structure. TVCG.
  4. Blended diffusion for text-driven editing of natural images. In CVPR, 18208–18218.
  5. All are Worth Words: a ViT Backbone for Score-based Diffusion Models. CVPR.
  6. Dynamic Dual-Output Diffusion Models. In CVPR, 11482–11491.
  7. Executing your Commands via Motion Diffusion in Latent Space. CVPR.
  8. MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis. CVPR.
  9. Diffusion models beat gans on image synthesis. NeurIPS, 34: 8780–8794.
  10. Synthesis of compositional animations from textual descriptions. In ICCV, 1396–1406.
  11. Generating diverse and natural 3d human motions from text. In CVPR, 5152–5161.
  12. Action2motion: Conditioned generation of 3d human motions. In ACM MM, 2021–2029.
  13. Denoising diffusion probabilistic models. NeurIPS, 33: 6840–6851.
  14. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.
  15. MotionGPT: Human Motion as a Foreign Language. NeurIPS.
  16. Guided motion diffusion for controllable human motion synthesis. In ICCV, 2151–2162.
  17. Flame: Free-form language-based motion synthesis & editing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 8255–8263.
  18. Ai choreographer: Music conditioned 3d dance generation with aist++. In ICCV, 13401–13412.
  19. Mat: Mask-aware transformer for large hole image inpainting. In CVPR, 10758–10768.
  20. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 11461–11471.
  21. Diffusion probabilistic models for 3d point cloud generation. In CVPR, 2837–2845.
  22. A Conditional Point Diffusion-Refinement Paradigm for 3D Point Cloud Completion. In ICLR.
  23. AMASS: Archive of motion capture as surface shapes. In ICCV, 5442–5451.
  24. Learning trajectory dependencies for human motion prediction. In ICCV, 9489–9497.
  25. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. ICML.
  26. Motion Inbetweening via Deep ΔΔ\Deltaroman_Δ-Interpolator. arXiv preprint arXiv:2201.06701.
  27. Action-conditioned 3D human motion synthesis with transformer VAE. In ICCV, 10985–10995.
  28. TEMOS: Generating diverse human motions from textual descriptions. In ECCV, 480–497. Springer.
  29. The KIT motion-language dataset. Big data, 4(4): 236–252.
  30. Learning transferable visual models from natural language supervision. In ICML, 8748–8763. PMLR.
  31. Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418.
  32. Shoemake, K. 1985. Animating rotation with quaternion curves. In SIGGRAPH, 245–254.
  33. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2256–2265. PMLR.
  34. Scorebased generative modeling through stochastic differential equations. ICLR.
  35. DeFeeNet: Consecutive 3D Human Motion Prediction with Deviation Feedback. In CVPR, 5527–5536.
  36. CSDI: Conditional score-based diffusion models for probabilistic time series imputation. NeurIPS, 34: 24804–24816.
  37. Human motion diffusion model. ICLR.
  38. Real time animation of virtual humans: a trade-off between naturalness and control. In Computer Graphics Forum, 2530–2554.
  39. Attention is all you need. NeurIPS, 30.
  40. Human joint kinematics diffusion-refinement for stochastic motion prediction. In AAAI, volume 37, 6110–6118.
  41. Dlow: Diversifying latent flows for diverse human motion prediction. In ECCV, 346–364. Springer.
  42. PhysDiff: Physics-Guided Human Motion Diffusion Model. ICCV.
  43. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001.
Citations (5)

Summary

We haven't generated a summary for this paper yet.