Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pay Attention and Move Better: Harnessing Attention for Interactive Motion Generation and Training-free Editing (2410.18977v2)

Published 24 Oct 2024 in cs.CV

Abstract: This research delves into the problem of interactive editing of human motion generation. Previous motion diffusion models lack explicit modeling of the word-level text-motion correspondence and good explainability, hence restricting their fine-grained editing ability. To address this issue, we propose an attention-based motion diffusion model, namely MotionCLR, with CLeaR modeling of attention mechanisms. Technically, MotionCLR models the in-modality and cross-modality interactions with self-attention and cross-attention, respectively. More specifically, the self-attention mechanism aims to measure the sequential similarity between frames and impacts the order of motion features. By contrast, the cross-attention mechanism works to find the fine-grained word-sequence correspondence and activate the corresponding timesteps in the motion sequence. Based on these key properties, we develop a versatile set of simple yet effective motion editing methods via manipulating attention maps, such as motion (de-)emphasizing, in-place motion replacement, and example-based motion generation, etc. For further verification of the explainability of the attention mechanism, we additionally explore the potential of action-counting and grounded motion generation ability via attention maps. Our experimental results show that our method enjoys good generation and editing ability with good explainability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (120)
  1. Skeleton-aware networks for deep motion retargeting. ACM TOG, 39(4):62–1, 2020a.
  2. Unpaired motion style transfer from video to animation. ACM TOG, 39(4):64–1, 2020b.
  3. Gradio: Hassle-free sharing and testing of ml models in the wild. arXiv preprint arXiv:1906.02569, 2019.
  4. Text2action: Generative adversarial synthesis from language to action. In ICRA, pages 5915–5920, 2018.
  5. Language2pose: Natural language grounded pose forecasting. In 3DV, pages 719–728, 2019.
  6. Gesturediffuclip: Gesture diffusion model with clip latents. ACM TOG, 42(4):1–18, 2023.
  7. Teach: Temporal action composition for 3d humans. In 3DV, pages 414–423, 2022.
  8. Sinc: Spatial composition of 3d human motions for simultaneous action generation. In ICCV, pages 9984–9995, 2023.
  9. MotionFix: Text-driven 3d human motion editing. In SIGGRAPH Asia, 2024.
  10. Seamless human motion composition with blended positional encodings. In CVPR, pages 457–469, 2024.
  11. Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In VR, pages 1–10, 2021.
  12. Digital life project: Autonomous 3d characters with social intelligence. In CVPR, pages 582–592, 2024.
  13. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In ICCV, pages 22560–22570, 2023.
  14. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In ICCV, pages 397–406, 2021a.
  15. Transformer interpretability beyond attention visualization. In CVPR, pages 782–791, 2021b.
  16. Humanmac: Masked motion completion for human motion prediction. In ICCV, pages 9544–9555, 2023a.
  17. Executing your commands via motion diffusion in latent space. In CVPR, pages 18000–18010, 2023b.
  18. Flexible motion in-betweening with diffusion models. In ACM SIGGRAPH, pages 1–9, 2024.
  19. Laserhuman: Language-guided scene-aware human motion generation in free environment. arXiv preprint arXiv:2403.13307, 2024.
  20. Anyskill: Learning open-vocabulary physical skill for interactive agents. In CVPR, pages 852–862, 2024.
  21. Mofusion: A framework for denoising-diffusion-based motion synthesis. In CVPR, pages 9760–9770, 2023.
  22. Knowledge neurons in pretrained transformers. In ACL, pages 8493–8502, 2022.
  23. Motionlcm: Real-time controllable motion generation via latent consistency model. ECCV, 2024.
  24. Cg-hoi: Contact-guided 3d human-object interaction generation. In CVPR, pages 19888–19901, 2024.
  25. Wandr: Intention-guided human motion generation. In CVPR, pages 927–936, 2024.
  26. Freemotion: A unified framework for number-free text-to-motion synthesis. ECCV, 2024.
  27. Robust dancer: Long-term 3d dance synthesis using unpaired data. arXiv preprint arXiv:2303.16856, 2023.
  28. Transformer feed-forward layers are key-value memories. In EMNLP, pages 5484–5495, 2021.
  29. Remos: Reactive 3d motion synthesis for two-person interactions. ECCV, 2023.
  30. Iterative motion editing with natural language. In ACM SIGGRAPH, pages 1–9, 2024.
  31. Tm2d: Bimodality driven 3d dance generation via music-text integration. In ICCV, pages 9942–9952, 2023.
  32. Generating diverse and natural 3d human motions from text. In CVPR, pages 5152–5161, 2022a.
  33. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In ECCV, pages 580–597, 2022b.
  34. Momask: Generative masked modeling of 3d human motions. In CVPR, pages 1900–1910, 2024a.
  35. Generative human motion stylization in latent space. ICLR, 2024b.
  36. Crowdmogen: Zero-shot text-driven collective motion generation. arXiv preprint arXiv:2407.06188, 2024c.
  37. Amd: Autoregressive motion diffusion. In AAAI, pages 2022–2030, 2024.
  38. Improving tuning-free real image editing with proximal guidance. WACV, 2023.
  39. Self-attention attribution: Interpreting information interactions inside transformer. In AAAI, volume 35, pages 12963–12971, 2021.
  40. Robust motion in-betweening. ACM TOG, 39(4):60–1, 2020.
  41. Prompt-to-prompt image editing with cross attention control. ICLR, 2023.
  42. A deep learning framework for character motion synthesis and editing. ACM TOG, 35(4):1–11, 2016.
  43. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM SIGGRAPH, 2022.
  44. Compositional 3d human-object neural animation. arXiv preprint arXiv:2304.14070, 2023.
  45. Stablemofusion: Towards robust and efficient diffusion-based motion generation framework. ACM MM, 2024.
  46. Motion puzzle: Arbitrary motion style transfer by body part. ACM TOG, 41(3):1–16, 2022.
  47. Motiongpt: Human motion as a foreign language. NeurIPS, 2024.
  48. Full-body articulated human-object interaction. ICCV, 3, 2022.
  49. Pnp inversion: Boosting diffusion-based editing with 3 lines of code. In ICLR, 2024.
  50. Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion. In CVPR, pages 1965–1974, 2024.
  51. Guided motion diffusion for controllable human motion synthesis. In CVPR, pages 2151–2162, 2023.
  52. Optimizing diffusion noise can serve as universal motion priors. In CVPR, pages 1334–1345, 2024.
  53. Flame: Free-form language-based motion synthesis & editing. In AAAI, volume 37, pages 8255–8263, 2023.
  54. Nifty: Neural object interaction fields for guided human motion synthesis. In CVPR, pages 947–957, 2024.
  55. A hierarchical approach to interactive motion editing for human-like figures. In ACM SIGGRAPH, pages 39–48, 1999.
  56. Object motion guided human motion synthesis. ACM TOG, 42(6):1–11, 2023a.
  57. Controllable human-object interaction synthesis. ECCV, 2024.
  58. Example-based motion synthesis via generative motion matching. ACM TOG, 42(4), 2023b. doi: 10.1145/3592395.
  59. Motion texture: a two-level statistical model for character motion synthesis. In ACM SIGGRAPH, pages 465–472, 2002.
  60. Intergen: Diffusion-based multi-human motion generation under complex interactions. IJCV, pages 1–21, 2024.
  61. Human motion modeling using dvgans. arXiv preprint arXiv:1804.10652, 2018.
  62. Plan, posture and go: Towards open-world text-to-motion generation. ECCV, 2024.
  63. Sampling-based contact-rich motion control. In ACM SIGGRAPH, pages 1–10, 2010.
  64. Interactive humanoid: Online full-body motion reaction synthesis with social affordance canonicalization and forecasting. arXiv preprint arXiv:2312.08983, 2023.
  65. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. NeurIPS, pages 5775–5787, 2022.
  66. Humantomato: Text-aligned whole-body motion generation. ICML, 2024.
  67. Visualizing and understanding patch interactions in vision transformer. IEEE TNNLS, 2023.
  68. Dragondiffusion: Enabling drag-style manipulation on diffusion models. ICLR, 2024.
  69. Zero-shot image-to-image translation. In ACM SIGGRAPH, pages 1–11, 2023.
  70. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 2019.
  71. Scikit-learn: Machine learning in python. IMLR, 12:2825–2830, 2011.
  72. Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models. arXiv preprint arXiv:2312.06553, 2023.
  73. Temos: Generating diverse human motions from textual descriptions. In ECCV, pages 480–497, 2022.
  74. Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis. In ICCV, pages 9488–9497, 2023.
  75. Multi-track timeline control for text-driven 3d human motion generation. In CVPRW, pages 1911–1921, 2024.
  76. Mmm: Generative masked motion model. In CVPR, pages 1546–1555, 2024.
  77. Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. RAS, 109:13–26, 2018.
  78. Modi: Unconditional motion synthesis from diverse data. In CVPR, pages 13873–13883, 2023.
  79. Monkey see, monkey do: Harnessing self-attention in motion diffusion for zero-shot motion transfer. ACM SIGGRAPH Asia, 2024a.
  80. Single motion diffusion. In ICLR, 2024b.
  81. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
  82. Human motion diffusion as a generative prior. In ICLR, 2024.
  83. Denoising diffusion implicit models. In ICLR, 2021.
  84. Real-time controllable motion transition for characters. ACM TOG, 41(4):1–10, 2022.
  85. Maskedmimic: Unified physics-based character control through masked motion inpainting. ACM SIGGRAPH ASIA, 2024.
  86. Motionclip: Exposing human motion generation to clip space. In ECCV, pages 358–374, 2022a.
  87. Human motion diffusion model. In ICLR, 2022b.
  88. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, pages 1921–1930, 2023.
  89. Attention is all you need. NeurIPS, 2017.
  90. Tlcontrol: Trajectory and language control for human motion synthesis. ECCV, 2024.
  91. Humanise: Language-conditioned human motion generation in 3d scenes. NeurIPS, pages 14959–14971, 2022.
  92. Move as you say interact as you can: Language-guided human motion generation with scene affordance. In CVPR, pages 433–444, 2024.
  93. Thor: Text to human-object interaction diffusion via relation intervention. arXiv preprint arXiv:2403.11208, 2024.
  94. Unified human-scene interaction via prompted chain-of-contacts. In ICLR, 2024.
  95. Omnicontrol: Control any joint at any time for human motion generation. In ICLR, 2024a.
  96. Towards detailed text-to-motion synthesis via basic-to-advanced hierarchical diffusion model. In AAAI, pages 6252–6260, 2024b.
  97. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048–2057. PMLR, 2015.
  98. Interdiff: Generating 3d human-object interactions with physics-informed diffusion. In ICCV, pages 14928–14940, 2023a.
  99. Stochastic multi-person 3d motion forecasting. In ICLR, 2023b.
  100. Interdreamer: Zero-shot text to 3d dynamic human-object interaction. arXiv preprint arXiv:2403.19652, 2024.
  101. Controlvae: Model-based learning of generative controllers for physics-based characters. ACM TOG, 41(6):1–16, 2022.
  102. Moconvq: Unified physics-based motion control via scalable discrete representations. ACM TOG, 43(4):1–21, 2024.
  103. Physdiff: Physics-guided human motion diffusion model. In ICCV, pages 16010–16021, 2023.
  104. Generating human motion from textual descriptions with discrete representations. In CVPR, pages 14730–14740, 2023a.
  105. Generative motion stylization of cross-structure characters within canonical motion space. In ACM MM, 2024a.
  106. Remodiffuse: Retrieval-augmented motion diffusion model. In ICCV, 2023b.
  107. Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE TPAMI, 2024b.
  108. Large motion model for unified multi-modal motion generation. arXiv preprint arXiv:2404.01284, 2024c.
  109. Finemogen: Fine-grained spatio-temporal motion generation and editing. NeurIPS, 36, 2024d.
  110. Egobody: Human body shape and motion of interacting people from head-mounted devices. In ECCV, pages 180–200. Springer, 2022.
  111. Perpetual motion: Generating unbounded human motion. arXiv preprint arXiv:2007.13886, 2020.
  112. Motiongpt: Finetuned llms are general-purpose motion generators. In AAAI, pages 7368–7376, 2024e.
  113. Real-world image variation by aligning diffusion inversion chain. NeurIPS, 2023c.
  114. Semantic gesticulator: Semantics-aware co-speech gesture synthesis. ACM TOG, 43(4):1–17, 2024f.
  115. Synthesizing diverse human motions in 3d indoor scenes. In ICCV, pages 14738–14749, 2023.
  116. Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. In ICCV, pages 509–519, 2023.
  117. Smoodi: Stylized motion diffusion model. ECCV, 2024.
  118. Emdm: Efficient motion diffusion model for fast, high-quality motion generation. ECCV, 2024.
  119. Ude: A unified driving engine for human motion generation. In CVPR, pages 5632–5641, 2023.
  120. Human motion generation: A survey. IEEE TPAMI, 2023.
Citations (2)

Summary

  • The paper introduces MotionCLR, a diffusion-based model that leverages attention mechanisms for improved text-driven motion generation and editing without retraining.
  • It achieves superior R-Precision (0.827 top-3) and FID scores on the HumanML3D dataset, underscoring enhanced text-motion alignment and generation quality.
  • The innovative editing techniques, such as motion emphasizing and sequence shifting, offer flexible, training-free customization of animation sequences.

Analysis of MotionCLR: Attention-Based Motion Generation and Editing

The paper "MotionCLR: Motion Generation and Training-free Editing via Understanding Attention Mechanisms" introduces a novel approach to human motion generation along with versatile editing capabilities. This analysis dissects the paper's contributions, technical framework, numerical results, and potential implications on the field of AI-driven animation.

Technical Overview

The authors present MotionCLR, a motion diffusion model that utilizes a clear understanding of attention mechanisms to address limitations of previous motion diffusion models. Existing models often struggled with explicit word-level text-motion correspondence, limiting their fine-grained editing ability. MotionCLR aims to overcome these with a U-Net-like architecture based on attention mechanisms to model in-modality and cross-modality interactions using self-attention and cross-attention, respectively.

  • Self-attention focuses on measuring the similarity between frames, thereby capturing sequence coherence within motion features.
  • Cross-attention establishes fine-grained word-sequence correspondence, activating specific timesteps relevant to motions depicted by textual prompts.

The MotionCLR architecture is built primarily on CLR blocks comprising convolution, self-attention, cross-attention, and feed-forward networks. Each block disentangles text from timestep embeddings to improve control over text-driven motion generation.

Experimental Results

On the HumanML3D dataset, MotionCLR demonstrates notable improvements in categories such as R-Precision and FID, indicating superior text-motion alignment and generation quality. For instance, the model achieved top-3 R-Precision of 0.827, surpassing many existing frameworks like MoMask and MotionDiffuse. These metrics underscore the model's efficacy in generating realistic motion synchronized with textual descriptions.

MotionCLR also excels in motion diversity and multi-modality assessments, further reinforcing its capability to produce varied, yet coherent, motion sequences from identical text prompts. The outcome suggests that MotionCLR has set a new benchmark in generating nuanced human motions with an attention to fine details, directly attributable to its innovative use of self- and cross-attention techniques.

Innovative Motion Editing Capabilities

Beyond generation, MotionCLR redefines motion editing through training-free techniques such as:

  • Motion Emphasizing/De-emphasizing: Altering cross-attention weights allows users to adjust the magnitude of specific actions like "jump," enhancing or diminishing characteristics based on textual input.
  • In-place Motion Replacement: By swapping cross-attention maps, one can seamlessly substitute motion sequences without retraining, which is efficient for customizing animations.
  • Motion Sequence Shifting: Reorganizing the self-attention order facilitates rearranging sequence order, offering flexible editing for creative demands.
  • Example-based Motion Generation and Style Transfer: Creative applications where attention manipulation helps generate diverse outputs similar in texture to example motions or transfer stylistic elements while preserving content reference.

Future Directions and Implications

The implications of MotionCLR are substantial in fields like animation, games, and virtual reality, where tailored, high-quality, text-driven animations can revolutionize content creation. Future developments could focus on expanding grounded motion generation, resolving potential generative model hallucinations as noted in the analyses conducted by the authors.

Understanding attention mechanisms at this nuanced level and applying them to motion generation and editing unlocks potential for further refinement and expansion of AI capabilities in creative industries. By continuing to address existing limitations and pushing theoretical boundaries, MotionCLR and subsequent iterations may enable even more sophisticated and semantically aware AI systems.

In summary, MotionCLR represents a significant step forward in AI animation, underpinned by attention mechanisms for nuanced text-to-motion translation and a novel suite of editing capabilities.