Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Seamless Human Motion Composition with Blended Positional Encodings (2402.15509v1)

Published 23 Feb 2024 in cs.CV

Abstract: Conditional human motion generation is an important topic with many applications in virtual reality, gaming, and robotics. While prior works have focused on generating motion guided by text, music, or scenes, these typically result in isolated motions confined to short durations. Instead, we address the generation of long, continuous sequences guided by a series of varying textual descriptions. In this context, we introduce FlowMDM, the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without any postprocessing or redundant denoising steps. For this, we introduce the Blended Positional Encodings, a technique that leverages both absolute and relative positional encodings in the denoising chain. More specifically, global motion coherence is recovered at the absolute stage, whereas smooth and realistic transitions are built at the relative stage. As a result, we achieve state-of-the-art results in terms of accuracy, realism, and smoothness on the Babel and HumanML3D datasets. FlowMDM excels when trained with only a single description per motion sequence thanks to its Pose-Centric Cross-ATtention, which makes it robust against varying text descriptions at inference time. Finally, to address the limitations of existing HMC metrics, we propose two new metrics: the Peak Jerk and the Area Under the Jerk, to detect abrupt transitions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (110)
  1. Tripod: Human trajectory and pose dynamics forecasting in the wild. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13390–13400, 2021.
  2. Listen, denoise, action! audio-driven motion synthesis with diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–20, 2023.
  3. Contextually plausible and diverse 3d human motion prediction. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  4. Teach: Temporal action composition for 3d humans. In 2022 International Conference on 3D Vision (3DV), pages 414–423. IEEE, 2022.
  5. A robust and sensitive metric for quantifying movement smoothness. IEEE transactions on biomedical engineering, 59(8):2126–2136, 2011.
  6. On the analysis of movement smoothness. Journal of neuroengineering and rehabilitation, 12(1):1–11, 2015.
  7. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023.
  8. Belfusion: Latent diffusion for behavior-driven human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2317–2327, 2023.
  9. Didn’t see that coming: a survey on non-verbal social human behavior forecasting. In Understanding Social Behavior in Dyadic and Small Group Interactions, pages 139–178. PMLR, 2022.
  10. Comparison of spatio-temporal models for human motion and pose forecasting in face-to-face interaction scenarios. In Understanding Social Behavior in Dyadic and Small Group Interactions, pages 107–138. PMLR, 2022.
  11. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  12. Behave: Dataset and method for tracking human object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15935–15946, 2022.
  13. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 561–578. Springer, 2016.
  14. Video-based human behavior understanding: A survey. IEEE transactions on circuits and systems for video technology, 23(11):1993–2008, 2013.
  15. Long-term human motion prediction with scene context. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 387–404. Springer, 2020.
  16. Bodiffusion: Diffusing sparse observations for full-body human motion synthesis. arXiv preprint arXiv:2304.11118, 2023.
  17. Choreomaster: choreography-oriented music-driven dance synthesis. ACM Transactions on Graphics (TOG), 40(4):1–13, 2021.
  18. Context-aware human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6992–7001, 2020.
  19. Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53–65, 2018.
  20. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  21. Mofusion: A framework for denoising-diffusion-based motion synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9760–9770, 2023.
  22. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  23. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  24. Reaction-anticipation transitions in human perception-action patterns. Human movement science, 15(6):809–832, 1996.
  25. Smoothness metrics in complex movement tasks. Frontiers in neurology, 9:615, 2018.
  26. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022.
  27. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In European Conference on Computer Vision, pages 580–597. Springer, 2022.
  28. Multi-person extreme motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13053–13064, 2022.
  29. Robust motion in-betweening. ACM Transactions on Graphics (TOG), 39(4):60–1, 2020.
  30. Stochastic scene-aware motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11374–11384, 2021.
  31. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  32. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  33. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  34. Sensitivity of smoothness measures to movement duration, amplitude, and arrests. Journal of motor behavior, 41(6):529–534, 2009.
  35. Motiongpt: Human motion as a foreign language. arXiv preprint arXiv:2306.14795, 2023.
  36. Convolutional autoencoders for human motion infilling. In 2020 International Conference on 3D Vision (3DV), pages 918–927. IEEE, 2020.
  37. The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, 2023.
  38. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  39. Conditional motion in-betweening. Pattern Recognition, 132:108894, 2022.
  40. Flame: Free-form language-based motion synthesis & editing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 8255–8263, 2023.
  41. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  42. Nifty: Neural object interaction fields for guided human motion synthesis. arXiv preprint arXiv:2307.07511, 2023.
  43. No anticipation–no action: the role of anticipation in action and perception. Cognitive Processing, 8:71–78, 2007.
  44. A review of computable expressive descriptors of human motion. In Proceedings of the 2nd International Workshop on Movement and Computing, pages 21–28, 2015.
  45. Multiact: Long-term 3d human motion generation from multiple action labels. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 1231–1239, 2023.
  46. Task-generic hierarchical human motion prior using vaes. In 2021 International Conference on 3D Vision (3DV), pages 771–781. IEEE, 2021.
  47. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13401–13412, 2021.
  48. Sequential texts driven cohesive motions synthesis with natural transitions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9498–9508, 2023.
  49. Example-based motion synthesis via generative motion matching. arXiv preprint arXiv:2306.00378, 2023.
  50. Skeleton2humanoid: Animating simulated characters for physically-plausible motion in-betweening. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1493–1502, 2022.
  51. Motion-x: A large-scale 3d expressive whole-body human motion dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  52. Perpetual humanoid control for real-time simulated avatars. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023.
  53. Multi-objective diverse human motion prediction with knowledge distillation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  54. Pixel codec avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 64–73, 2021.
  55. Evaluating the quality of a synthesized motion with the fréchet motion distance. In ACM SIGGRAPH 2022 Posters, pages 1–2, 2022.
  56. Validating objective evaluation metric: Is fréchet motion distance able to capture foot skating artifacts? In Proceedings of the 2023 ACM International Conference on Interactive Media Experiences, pages 242–247, 2023.
  57. Generating smooth pose sequences for diverse human motion prediction. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  58. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  59. Motion in-betweening via deep delta-interpolator. IEEE Transactions on Visualization and Computer Graphics, 2023.
  60. Chalearn lap challenges on self-reported personality recognition and non-verbal behavior forecasting during social dyadic interactions: Dataset, design, and results. In Understanding Social Behavior in Dyadic and Small Group Interactions, pages 4–52. PMLR, 2022.
  61. Fast vision transformers with hilo attention. Advances in Neural Information Processing Systems, 35:14541–14554, 2022.
  62. A metaverse: Taxonomy, components, applications, and open challenges. IEEE access, 10:4209–4251, 2022.
  63. Temos: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision, pages 480–497. Springer, 2022.
  64. The kit motion-language dataset. Big data, 4(4):236–252, 2016.
  65. Babel: Bodies, action and behavior with english labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 722–731, 2021.
  66. Breaking the limits of text-conditioned 3d motion synthesis with elaborative descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2306–2316, 2023.
  67. Motion in-betweening via two-stage transformers. ACM Transactions on Graphics (TOG), 41(6):1–16, 2022.
  68. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  69. Diverse motion in-betweening with dual posture stitching. arXiv preprint arXiv:2303.14457, 2023.
  70. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  71. Assessing generative models via precision and recall. Advances in neural information processing systems, 31, 2018.
  72. Motron: Multimodal probabilistic human motion forecasting. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  73. Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418, 2023.
  74. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  75. Motion in-betweening with phase manifolds. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(3):1–17, 2023.
  76. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  77. Deepdance: music-to-dance motion choreography with adversarial learning. IEEE Transactions on Multimedia, 23:497–509, 2020.
  78. Towards globally consistent stochastic human motion prediction via motion diffusion. arXiv preprint arXiv:2305.12554, 2023.
  79. Augmented reality and robotics: A survey and taxonomy for ar-enhanced human-robot interaction and robotic interfaces. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–33, 2022.
  80. Social diffusion: Long-term multiple human motion anticipation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9601–9611, 2023.
  81. Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision, pages 358–374. Springer, 2022.
  82. Human motion diffusion model. In The Eleventh International Conference on Learning Representations, 2022.
  83. Transfusion: A practical and effective transformer-based diffusion model for 3d human motion prediction. arXiv preprint arXiv:2307.16106, 2023.
  84. Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023.
  85. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  86. The pose knows: Video forecasting by generating pose futures. Proceedings of the IEEE international conference on computer vision, 2017.
  87. Towards diverse and natural scene-aware 3d human motion synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20460–20469, 2022.
  88. Synthesizing long-term 3d human motion and interaction in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9401–9411, 2021.
  89. Scene-aware generative network for human motion synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12206–12215, 2021.
  90. Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Representations (ICLR), 2022.
  91. Exploring versatile prior for human motion via motion frequency guidance. In 2021 International Conference on 3D Vision (3DV), pages 606–616. IEEE, 2021.
  92. Interdiff: Generating 3d human-object interactions with physics-informed diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14928–14940, 2023.
  93. Stochastic multi-person 3d motion forecasting. In The Eleventh International Conference on Learning Representations, 2022.
  94. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 2022.
  95. Synthesizing long-term human motions with diffusion models via coherent sampling. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3954–3964, 2023.
  96. Choreonet: Towards music to dance synthesis with choreographic action unit. In Proceedings of the 28th ACM International Conference on Multimedia, pages 744–752, 2020.
  97. Mime: Human-aware 3d scene generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12965–12976, 2023.
  98. Transpose: Real-time 3d human translation and pose estimation with six inertial sensors. ACM Transactions on Graphics (TOG), 40(4):1–13, 2021.
  99. Dlow: Diversifying latent flows for diverse human motion prediction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 346–364. Springer, 2020.
  100. Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16010–16021, 2023.
  101. Generating human motion from textual descriptions with discrete representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14730–14740, 2023.
  102. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
  103. Diffcollage: Parallel generation of large content with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10188–10198, 2023.
  104. Perpetual motion: Generating unbounded human motion. arXiv preprint arXiv:2007.13886, 2020.
  105. The wanderings of odysseus in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20481–20491, 2022.
  106. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.
  107. Auto-conditioned recurrent networks for extended complex human motion synthesis. In International Conference on Learning Representations, 2018.
  108. Generative tweening: Long-term inbetweening of 3d human motions. arXiv preprint arXiv:2005.08891, 2020.
  109. Human motion generation: A survey. arXiv preprint arXiv:2307.10894, 2023.
  110. Music2dance: Dancenet for music-driven dance generation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 18(2):1–21, 2022.
Citations (14)

Summary

  • The paper introduces FlowMDM, a model that uses blended positional encodings to enable seamless integration of human motion sequences.
  • It employs pose-centric cross-attention to handle varying textual descriptions and maintain consistent motion quality.
  • New metrics like Peak Jerk and Area Under the Jerk provide a detailed evaluation of motion smoothness and realism.

Enhancing Human Motion Generation with FlowMDM: A Deep Dive into Seamless Transitions and Realism

Introduction to FlowMDM

In the advancing field of human motion generation, particularly for applications spanning from virtual reality to robotics, the challenge of creating long, seamless transitions between motion sequences driven by varied textual descriptions remains significant. Tackling this challenge head-on, the introduction of FlowMDM marks a significant stride forward. This innovative model not only addresses the seamless integration of motion sequences but does so without the need for any post-processing or redundant denoising steps, a common drawback in previous methodologies.

Key Contributions

FlowMDM introduces a set of novel concepts that refine the process of human motion composition (HMC). At its core, the model utilizes Blended Positional Encodings (BPE) to optimize the denoising process—a crucial step that ensures the seamless transition between motion sequences. This technique ingeniously combines the benefits of both absolute and relative positional encodings, thereby ensuring global motion coherence and the smooth transition of actions. When applied to complex HMC tasks across the Babel and HumanML3D datasets, FlowMDM demonstrates superior performance in metrics of accuracy, realism, and smoothness.

In a departure from conventional models that struggle with stark domain shifts during inference, FlowMDM introduces Pose-Centric Cross-Attention (PCCAT). This technique ensures that the model is robust against varying text descriptions, achieving consistency in motion generation even when faced with descriptions unseen during training.

Furthermore, recognizing the limitations of existing metrics to adequately capture the nuances of HMC, FlowMDM proposes new metrics—Peak Jerk and Area Under the Jerk. These metrics focus on motion smoothness and the detection of abrupt transitions, providing a more granular assessment of motion quality.

Practical Applications and Theoretical Implications

FlowMDM's advancements have significant implications for both theoretical research and practical applications. By eliminating the need for post-processing and redundant denoising, the model streamlines the generation process, enhancing efficiency and reducing computational overhead. This has direct benefits for real-time applications in VR, gaming, and interactive robotics, where seamless and realistic human motion is crucial.

From a theoretical standpoint, the introduction of BPE and PCCAT adds new dimensions to the understanding of how diffusion models can be adapted and optimized for specific tasks like HMC. The proposed metrics further enrich the toolkit available to researchers, enabling finer scrutiny of model performance.

Looking Ahead: Future Directions

While FlowMDM sets a new benchmark in HMC, the model also opens avenues for further research. One potential direction is the integration of an intention planning module to model relationships between subsequences at the absolute stage, addressing one of FlowMDM's noted limitations. Additionally, exploring the applicability of FlowMDM's techniques to other control signals and across different datasets could reveal universal principles applicable to conditional human motion generation at large.

Conclusion

FlowMDM represents a significant advancement in the generation of seamless human motion compositions. By addressing key challenges with innovative solutions and pushing the boundaries of current metrics, this model not only achieves state-of-the-art results but also paves the way for future research in the field. Its contributions are set to impact a wide range of applications, further bridging the gap between artificial and human motion realism.