Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AMD: Autoregressive Motion Diffusion (2305.09381v8)

Published 16 May 2023 in cs.MM

Abstract: Human motion generation aims to produce plausible human motion sequences according to various conditional inputs, such as text or audio. Despite the feasibility of existing methods in generating motion based on short prompts and simple motion patterns, they encounter difficulties when dealing with long prompts or complex motions. The challenges are two-fold: 1) the scarcity of human motion-captured data for long prompts and complex motions. 2) the high diversity of human motions in the temporal domain and the substantial divergence of distributions from conditional modalities, leading to a many-to-many mapping problem when generating motion with complex and long texts. In this work, we address these gaps by 1) elaborating the first dataset pairing long textual descriptions and 3D complex motions (HumanLong3D), and 2) proposing an autoregressive motion diffusion model (AMD). Specifically, AMD integrates the text prompt at the current timestep with the text prompt and action sequences at the previous timestep as conditional information to predict the current action sequences in an iterative manner. Furthermore, we present its generalization for X-to-Motion with "No Modality Left Behind", enabling the generation of high-definition and high-fidelity human motions based on user-defined modality input.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV), 719–728. IEEE.
  2. GrooveNet: Real-time music-driven dance movement generation using artificial neural networks. networks, 8(17): 26.
  3. Rhythm is a Dancer: Music-Driven Motion Synthesis with Global Structure. IEEE Transactions on Visualization & Computer Graphics, (01): 1–1.
  4. TEACH: Temporal Action Compositions for 3D Humans. In International Conference on 3D Vision (3DV).
  5. Simulating humans: computer graphics animation and control. Oxford University Press.
  6. Humman: Multi-modal 4d human dataset for versatile sensing and modeling. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, 557–577. Springer.
  7. Example-based automatic music-driven conventional dance motion synthesis. IEEE transactions on visualization and computer graphics, 18(3): 501–515.
  8. Gavrila, D. M. 1999. The visual analysis of human movement: A survey. Computer vision and image understanding, 73(1): 82–98.
  9. Generative adversarial networks. Communications of the ACM, 63(11): 139–144.
  10. A neural temporal model for human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12116–12125.
  11. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5152–5161.
  12. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, 2021–2029.
  13. Zero3D: Semantic-Driven Multi-Category 3D Shape Generation. arXiv preprint arXiv:2301.13591.
  14. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
  15. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 6840–6851.
  16. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.
  17. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (TOG), 35(4): 1–11.
  18. AvatarCLIP: zero-shot text-driven generation and animation of 3D avatars. ACM Transactions on Graphics (TOG), 41(4): 1–19.
  19. Autoregressive Diffusion Models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  20. Dance revolution: Long-term dance generation with music via curriculum learning. arXiv preprint arXiv:2006.06119.
  21. Generalizing motion edits with gaussian processes. ACM Transactions on Graphics (TOG), 28(1): 1–12.
  22. A large-scale RGB-D database for arbitrary-view human action recognition. In Proceedings of the 26th ACM international Conference on Multimedia, 1510–1518.
  23. Towards the automatic anime characters creation with generative adversarial networks. arXiv preprint arXiv:1708.05509.
  24. Action-GPT: Leveraging Large-scale Language Models for Improved and Generalized Zero Shot Action Generation. arXiv preprint arXiv:2211.15603.
  25. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31.
  26. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  27. Dancing to music. Advances in neural information processing systems, 32.
  28. Music similarity-based approach to generating dance motion sequence. Multimedia tools and applications, 62: 895–912.
  29. Danceformer: Music conditioned 3d dance generation with parametric motion transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 1272–1279.
  30. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3383–3393.
  31. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 13401–13412.
  32. Cliff: Carrying location information in full frames into human pose and shape estimation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V, 590–606. Springer.
  33. SMPL: A skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6): 1–16.
  34. VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation. arXiv e-prints, arXiv–2303.
  35. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision, 5442–5451.
  36. Motion graphs++ a compact generative model for semantic motion analysis and synthesis. ACM Transactions on Graphics (TOG), 31(6): 1–12.
  37. Geostatistical motion interpolation. In ACM SIGGRAPH 2005 Papers, 1062–1070.
  38. Efficient content-based retrieval of motion capture data. In ACM SIGGRAPH 2005 Papers, 677–685.
  39. FMDistance: A Fast and Effective Distance Function for Motion Capture Data. In Eurographics (Short Papers), 83–86.
  40. Representing cyclic human motion using functional analysis. Image and Vision Computing, 23(14): 1264–1276.
  41. Model-based image analysis of human motion using constraint propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6): 522–536.
  42. Modeling human motion with quaternion-based neural networks. International Journal of Computer Vision, 128: 855–872.
  43. TEMOS: Generating diverse human motions from textual descriptions. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, 480–497. Springer.
  44. The KIT motion-language dataset. Big data, 4(4): 236–252.
  45. BABEL: Bodies, action and behavior with english labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 722–731.
  46. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  47. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
  48. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695.
  49. Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics (TOG), 36(6): 1–17.
  50. Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610.
  51. Verbs and adverbs: Multidimensional motion interpolation. IEEE Computer Graphics and Applications, 18(5): 32–40.
  52. Construction and optimal search of interpolated motion graphs. In ACM SIGGRAPH 2007 papers, 106–es.
  53. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 36479–36494.
  54. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11050–11059.
  55. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2256–2265. PMLR.
  56. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456.
  57. DeepDance: music-to-dance motion choreography with adversarial learning. IEEE Transactions on Multimedia, 23: 497–509.
  58. Dance with melody: An lstm-autoencoder approach to music-oriented dance synthesis. In Proceedings of the 26th ACM international conference on Multimedia, 1598–1606.
  59. Motionclip: Exposing human motion generation to clip space. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, 358–374. Springer.
  60. Human motion diffusion model. arXiv preprint arXiv:2209.14916.
  61. EDGE: Editable Dance Generation From Music. arXiv preprint arXiv:2211.10658.
  62. Attention is all you need. Advances in neural information processing systems, 30.
  63. Executing your Commands via Motion Diffusion in Latent Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  64. Humbi: A large multiview dataset of human body expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2990–3000.
  65. T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. arXiv preprint arXiv:2301.06052.
  66. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001.
  67. Music2dance: Dancenet for music-driven dance generation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18(2): 1–21.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Bo Han (282 papers)
  2. Hao Peng (291 papers)
  3. Minjing Dong (28 papers)
  4. Yi Ren (215 papers)
  5. Yixuan Shen (7 papers)
  6. Chang Xu (323 papers)
Citations (13)

Summary

We haven't generated a summary for this paper yet.