Pay Attention and Move Better: Harnessing Attention for Interactive Motion Generation and Training-free Editing (2410.18977v2)
Abstract: This research delves into the problem of interactive editing of human motion generation. Previous motion diffusion models lack explicit modeling of the word-level text-motion correspondence and good explainability, hence restricting their fine-grained editing ability. To address this issue, we propose an attention-based motion diffusion model, namely MotionCLR, with CLeaR modeling of attention mechanisms. Technically, MotionCLR models the in-modality and cross-modality interactions with self-attention and cross-attention, respectively. More specifically, the self-attention mechanism aims to measure the sequential similarity between frames and impacts the order of motion features. By contrast, the cross-attention mechanism works to find the fine-grained word-sequence correspondence and activate the corresponding timesteps in the motion sequence. Based on these key properties, we develop a versatile set of simple yet effective motion editing methods via manipulating attention maps, such as motion (de-)emphasizing, in-place motion replacement, and example-based motion generation, etc. For further verification of the explainability of the attention mechanism, we additionally explore the potential of action-counting and grounded motion generation ability via attention maps. Our experimental results show that our method enjoys good generation and editing ability with good explainability.
- Skeleton-aware networks for deep motion retargeting. ACM TOG, 39(4):62–1, 2020a.
- Unpaired motion style transfer from video to animation. ACM TOG, 39(4):64–1, 2020b.
- Gradio: Hassle-free sharing and testing of ml models in the wild. arXiv preprint arXiv:1906.02569, 2019.
- Text2action: Generative adversarial synthesis from language to action. In ICRA, pages 5915–5920, 2018.
- Language2pose: Natural language grounded pose forecasting. In 3DV, pages 719–728, 2019.
- Gesturediffuclip: Gesture diffusion model with clip latents. ACM TOG, 42(4):1–18, 2023.
- Teach: Temporal action composition for 3d humans. In 3DV, pages 414–423, 2022.
- Sinc: Spatial composition of 3d human motions for simultaneous action generation. In ICCV, pages 9984–9995, 2023.
- MotionFix: Text-driven 3d human motion editing. In SIGGRAPH Asia, 2024.
- Seamless human motion composition with blended positional encodings. In CVPR, pages 457–469, 2024.
- Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In VR, pages 1–10, 2021.
- Digital life project: Autonomous 3d characters with social intelligence. In CVPR, pages 582–592, 2024.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In ICCV, pages 22560–22570, 2023.
- Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In ICCV, pages 397–406, 2021a.
- Transformer interpretability beyond attention visualization. In CVPR, pages 782–791, 2021b.
- Humanmac: Masked motion completion for human motion prediction. In ICCV, pages 9544–9555, 2023a.
- Executing your commands via motion diffusion in latent space. In CVPR, pages 18000–18010, 2023b.
- Flexible motion in-betweening with diffusion models. In ACM SIGGRAPH, pages 1–9, 2024.
- Laserhuman: Language-guided scene-aware human motion generation in free environment. arXiv preprint arXiv:2403.13307, 2024.
- Anyskill: Learning open-vocabulary physical skill for interactive agents. In CVPR, pages 852–862, 2024.
- Mofusion: A framework for denoising-diffusion-based motion synthesis. In CVPR, pages 9760–9770, 2023.
- Knowledge neurons in pretrained transformers. In ACL, pages 8493–8502, 2022.
- Motionlcm: Real-time controllable motion generation via latent consistency model. ECCV, 2024.
- Cg-hoi: Contact-guided 3d human-object interaction generation. In CVPR, pages 19888–19901, 2024.
- Wandr: Intention-guided human motion generation. In CVPR, pages 927–936, 2024.
- Freemotion: A unified framework for number-free text-to-motion synthesis. ECCV, 2024.
- Robust dancer: Long-term 3d dance synthesis using unpaired data. arXiv preprint arXiv:2303.16856, 2023.
- Transformer feed-forward layers are key-value memories. In EMNLP, pages 5484–5495, 2021.
- Remos: Reactive 3d motion synthesis for two-person interactions. ECCV, 2023.
- Iterative motion editing with natural language. In ACM SIGGRAPH, pages 1–9, 2024.
- Tm2d: Bimodality driven 3d dance generation via music-text integration. In ICCV, pages 9942–9952, 2023.
- Generating diverse and natural 3d human motions from text. In CVPR, pages 5152–5161, 2022a.
- Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In ECCV, pages 580–597, 2022b.
- Momask: Generative masked modeling of 3d human motions. In CVPR, pages 1900–1910, 2024a.
- Generative human motion stylization in latent space. ICLR, 2024b.
- Crowdmogen: Zero-shot text-driven collective motion generation. arXiv preprint arXiv:2407.06188, 2024c.
- Amd: Autoregressive motion diffusion. In AAAI, pages 2022–2030, 2024.
- Improving tuning-free real image editing with proximal guidance. WACV, 2023.
- Self-attention attribution: Interpreting information interactions inside transformer. In AAAI, volume 35, pages 12963–12971, 2021.
- Robust motion in-betweening. ACM TOG, 39(4):60–1, 2020.
- Prompt-to-prompt image editing with cross attention control. ICLR, 2023.
- A deep learning framework for character motion synthesis and editing. ACM TOG, 35(4):1–11, 2016.
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM SIGGRAPH, 2022.
- Compositional 3d human-object neural animation. arXiv preprint arXiv:2304.14070, 2023.
- Stablemofusion: Towards robust and efficient diffusion-based motion generation framework. ACM MM, 2024.
- Motion puzzle: Arbitrary motion style transfer by body part. ACM TOG, 41(3):1–16, 2022.
- Motiongpt: Human motion as a foreign language. NeurIPS, 2024.
- Full-body articulated human-object interaction. ICCV, 3, 2022.
- Pnp inversion: Boosting diffusion-based editing with 3 lines of code. In ICLR, 2024.
- Mas: Multi-view ancestral sampling for 3d motion generation using 2d diffusion. In CVPR, pages 1965–1974, 2024.
- Guided motion diffusion for controllable human motion synthesis. In CVPR, pages 2151–2162, 2023.
- Optimizing diffusion noise can serve as universal motion priors. In CVPR, pages 1334–1345, 2024.
- Flame: Free-form language-based motion synthesis & editing. In AAAI, volume 37, pages 8255–8263, 2023.
- Nifty: Neural object interaction fields for guided human motion synthesis. In CVPR, pages 947–957, 2024.
- A hierarchical approach to interactive motion editing for human-like figures. In ACM SIGGRAPH, pages 39–48, 1999.
- Object motion guided human motion synthesis. ACM TOG, 42(6):1–11, 2023a.
- Controllable human-object interaction synthesis. ECCV, 2024.
- Example-based motion synthesis via generative motion matching. ACM TOG, 42(4), 2023b. doi: 10.1145/3592395.
- Motion texture: a two-level statistical model for character motion synthesis. In ACM SIGGRAPH, pages 465–472, 2002.
- Intergen: Diffusion-based multi-human motion generation under complex interactions. IJCV, pages 1–21, 2024.
- Human motion modeling using dvgans. arXiv preprint arXiv:1804.10652, 2018.
- Plan, posture and go: Towards open-world text-to-motion generation. ECCV, 2024.
- Sampling-based contact-rich motion control. In ACM SIGGRAPH, pages 1–10, 2010.
- Interactive humanoid: Online full-body motion reaction synthesis with social affordance canonicalization and forecasting. arXiv preprint arXiv:2312.08983, 2023.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. NeurIPS, pages 5775–5787, 2022.
- Humantomato: Text-aligned whole-body motion generation. ICML, 2024.
- Visualizing and understanding patch interactions in vision transformer. IEEE TNNLS, 2023.
- Dragondiffusion: Enabling drag-style manipulation on diffusion models. ICLR, 2024.
- Zero-shot image-to-image translation. In ACM SIGGRAPH, pages 1–11, 2023.
- Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 2019.
- Scikit-learn: Machine learning in python. IMLR, 12:2825–2830, 2011.
- Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models. arXiv preprint arXiv:2312.06553, 2023.
- Temos: Generating diverse human motions from textual descriptions. In ECCV, pages 480–497, 2022.
- Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis. In ICCV, pages 9488–9497, 2023.
- Multi-track timeline control for text-driven 3d human motion generation. In CVPRW, pages 1911–1921, 2024.
- Mmm: Generative masked motion model. In CVPR, pages 1546–1555, 2024.
- Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. RAS, 109:13–26, 2018.
- Modi: Unconditional motion synthesis from diverse data. In CVPR, pages 13873–13883, 2023.
- Monkey see, monkey do: Harnessing self-attention in motion diffusion for zero-shot motion transfer. ACM SIGGRAPH Asia, 2024a.
- Single motion diffusion. In ICLR, 2024b.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
- Human motion diffusion as a generative prior. In ICLR, 2024.
- Denoising diffusion implicit models. In ICLR, 2021.
- Real-time controllable motion transition for characters. ACM TOG, 41(4):1–10, 2022.
- Maskedmimic: Unified physics-based character control through masked motion inpainting. ACM SIGGRAPH ASIA, 2024.
- Motionclip: Exposing human motion generation to clip space. In ECCV, pages 358–374, 2022a.
- Human motion diffusion model. In ICLR, 2022b.
- Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, pages 1921–1930, 2023.
- Attention is all you need. NeurIPS, 2017.
- Tlcontrol: Trajectory and language control for human motion synthesis. ECCV, 2024.
- Humanise: Language-conditioned human motion generation in 3d scenes. NeurIPS, pages 14959–14971, 2022.
- Move as you say interact as you can: Language-guided human motion generation with scene affordance. In CVPR, pages 433–444, 2024.
- Thor: Text to human-object interaction diffusion via relation intervention. arXiv preprint arXiv:2403.11208, 2024.
- Unified human-scene interaction via prompted chain-of-contacts. In ICLR, 2024.
- Omnicontrol: Control any joint at any time for human motion generation. In ICLR, 2024a.
- Towards detailed text-to-motion synthesis via basic-to-advanced hierarchical diffusion model. In AAAI, pages 6252–6260, 2024b.
- Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048–2057. PMLR, 2015.
- Interdiff: Generating 3d human-object interactions with physics-informed diffusion. In ICCV, pages 14928–14940, 2023a.
- Stochastic multi-person 3d motion forecasting. In ICLR, 2023b.
- Interdreamer: Zero-shot text to 3d dynamic human-object interaction. arXiv preprint arXiv:2403.19652, 2024.
- Controlvae: Model-based learning of generative controllers for physics-based characters. ACM TOG, 41(6):1–16, 2022.
- Moconvq: Unified physics-based motion control via scalable discrete representations. ACM TOG, 43(4):1–21, 2024.
- Physdiff: Physics-guided human motion diffusion model. In ICCV, pages 16010–16021, 2023.
- Generating human motion from textual descriptions with discrete representations. In CVPR, pages 14730–14740, 2023a.
- Generative motion stylization of cross-structure characters within canonical motion space. In ACM MM, 2024a.
- Remodiffuse: Retrieval-augmented motion diffusion model. In ICCV, 2023b.
- Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE TPAMI, 2024b.
- Large motion model for unified multi-modal motion generation. arXiv preprint arXiv:2404.01284, 2024c.
- Finemogen: Fine-grained spatio-temporal motion generation and editing. NeurIPS, 36, 2024d.
- Egobody: Human body shape and motion of interacting people from head-mounted devices. In ECCV, pages 180–200. Springer, 2022.
- Perpetual motion: Generating unbounded human motion. arXiv preprint arXiv:2007.13886, 2020.
- Motiongpt: Finetuned llms are general-purpose motion generators. In AAAI, pages 7368–7376, 2024e.
- Real-world image variation by aligning diffusion inversion chain. NeurIPS, 2023c.
- Semantic gesticulator: Semantics-aware co-speech gesture synthesis. ACM TOG, 43(4):1–17, 2024f.
- Synthesizing diverse human motions in 3d indoor scenes. In ICCV, pages 14738–14749, 2023.
- Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. In ICCV, pages 509–519, 2023.
- Smoodi: Stylized motion diffusion model. ECCV, 2024.
- Emdm: Efficient motion diffusion model for fast, high-quality motion generation. ECCV, 2024.
- Ude: A unified driving engine for human motion generation. In CVPR, pages 5632–5641, 2023.
- Human motion generation: A survey. IEEE TPAMI, 2023.