Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion (2404.11375v1)

Published 17 Apr 2024 in cs.CV and cs.MM

Abstract: Human motion understanding is a fundamental task with diverse practical applications, facilitated by the availability of large-scale motion capture datasets. Recent studies focus on text-motion tasks, such as text-based motion generation, editing and question answering. In this study, we introduce the novel task of text-based human motion grounding (THMG), aimed at precisely localizing temporal segments corresponding to given textual descriptions within untrimmed motion sequences. Capturing global temporal information is crucial for the THMG task. However, transformer-based models that rely on global temporal self-attention face challenges when handling long untrimmed sequences due to the quadratic computational cost. We address these challenges by proposing Text-controlled Motion Mamba (TM-Mamba), a unified model that integrates temporal global context, language query control, and spatial graph topology with only linear memory cost. The core of the model is a text-controlled selection mechanism which dynamically incorporates global temporal information based on text query. The model is further enhanced to be topology-aware through the integration of relational embeddings. For evaluation, we introduce BABEL-Grounding, the first text-motion dataset that provides detailed textual descriptions of human actions along with their corresponding temporal segments. Extensive evaluations demonstrate the effectiveness of TM-Mamba on BABEL-Grounding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (93)
  1. Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV). IEEE, 719–728.
  2. Teach: Temporal action composition for 3d humans. In 2022 International Conference on 3D Vision (3DV). IEEE, 414–423.
  3. Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation. arXiv preprint arXiv:2305.09662 (2023).
  4. Ali Behrouz and Farnoosh Hashemi. 2024. Graph Mamba: Towards Learning on Graphs with State Space Models. arXiv preprint arXiv:2402.08678 (2024).
  5. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition. 961–970.
  6. Humman: Multi-modal 4d human dataset for versatile sensing and modeling. In European Conference on Computer Vision. Springer, 557–577.
  7. Executing your Commands via Motion Diffusion in Latent Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18000–18010.
  8. Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20186–20196.
  9. Mofusion: A framework for denoising-diffusion-based motion synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9760–9770.
  10. PoseScript: 3D human poses from natural language. In European Conference on Computer Vision. Springer, 346–362.
  11. Posefix: Correcting 3d human poses with natural language. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15018–15028.
  12. Motion Question Answering via Modular Motion Programs. arXiv preprint arXiv:2305.08953 (2023).
  13. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052 (2022).
  14. Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF international conference on computer vision. 1396–1406.
  15. Iterative Motion Editing with Natural Language. arXiv preprint arXiv:2312.11538 (2023).
  16. Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023).
  17. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems 33 (2020), 1474–1487.
  18. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems 35 (2022), 35971–35983.
  19. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 (2021).
  20. How to train your hippo: State space models with generalized orthogonal basis projections. arXiv preprint arXiv:2206.12037 (2022).
  21. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5152–5161.
  22. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In European Conference on Computer Vision. Springer, 580–597.
  23. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia. 2021–2029.
  24. MambaIR: A Simple Baseline for Image Restoration with State-Space Model. arXiv preprint arXiv:2402.15648 (2024).
  25. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems 35 (2022), 22982–22994.
  26. AMD Autoregressive Motion Diffusion. arXiv preprint arXiv:2305.09381 (2023).
  27. HuTuMotion: Human-Tuned Navigation of Latent Motion Diffusion Models with Minimal Feedback. arXiv preprint arXiv:2312.12227 (2023).
  28. Liquid structural state-space models. arXiv preprint arXiv:2209.12951 (2022).
  29. Pan-Mamba: Effective pan-sharpening with State Space Model. arXiv preprint arXiv:2402.12192 (2024).
  30. MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation. arXiv preprint arXiv:2401.11115 (2024).
  31. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36, 7 (2013), 1325–1339.
  32. Knowing where to focus: Event-aware transformer for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13846–13856.
  33. Motiongpt: Human motion as a foreign language. Advances in Neural Information Processing Systems 36 (2024).
  34. Act as you wish: Fine-grained control of motion diffusion model with hierarchical semantic graphs. arXiv preprint arXiv:2311.01015 (2023).
  35. Embracing consistency: A one-stage approach for spatio-temporal video grounding. Advances in Neural Information Processing Systems 35 (2022), 29192–29204.
  36. Guided motion diffusion for controllable human motion synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2151–2162.
  37. Flame: Free-form language-based motion synthesis & editing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 8255–8263.
  38. Priority-centric human motion generation in discrete latent space. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14806–14816.
  39. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems 34 (2021), 11846–11858.
  40. VideoMamba: State Space Model for Efficient Video Understanding. arXiv preprint arXiv:2403.06977 (2024).
  41. Sequential texts driven cohesive motions synthesis with natural transitions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9498–9508.
  42. PointMamba: A Simple State Space Model for Point Cloud Analysis. arXiv preprint arXiv:2402.10739 (2024).
  43. InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions. arXiv preprint arXiv:2304.05684 (2023).
  44. Being comes from not-being: Open-vocabulary text-to-motion generation with wordless training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 23222–23231.
  45. Motion-x: A large-scale 3d expressive whole-body human motion dataset. arXiv preprint arXiv:2307.00818 (2023).
  46. Swin-umamba: Mamba-based unet with imagenet-based pretraining. arXiv preprint arXiv:2402.03302 (2024).
  47. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166 (2024).
  48. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
  49. Pretrained diffusion models for unified human motion synthesis. arXiv preprint arXiv:2212.02837 (2022).
  50. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722 (2024).
  51. AMASS: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision. 5442–5451.
  52. The KIT whole-body human motion database. In 2015 International Conference on Advanced Robotics (ICAR). IEEE, 329–336.
  53. Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947 (2022).
  54. Text-to-motion retrieval: Towards joint understanding of human motion data and natural language. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2420–2425.
  55. HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models. arXiv preprint arXiv:2312.06553 (2023).
  56. TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision. Springer, 480–497.
  57. TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis. arXiv preprint arXiv:2305.00976 (2023).
  58. The KIT motion-language dataset. Big data 4, 4 (2016), 236–252.
  59. BABEL: Bodies, action and behavior with english labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 722–731.
  60. Breaking The Limits of Text-conditioned 3D Motion Synthesis with Elaborative Descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2306–2316.
  61. VL-Mamba: Exploring State Space Models for Multimodal Learning. arXiv preprint arXiv:2403.13600 (2024).
  62. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  63. Jiacheng Ruan and Suncheng Xiang. 2024. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491 (2024).
  64. Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418 (2023).
  65. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1010–1019.
  66. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12026–12035.
  67. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933 (2022).
  68. Flag3d: A 3d fitness activity dataset with language instruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22106–22117.
  69. Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision. Springer, 358–374.
  70. Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv preprint arXiv:2402.00789 (2024).
  71. Fg-t2m: Fine-grained text-driven human motion generation via diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22035–22044.
  72. Humanise: Language-conditioned human motion generation in 3d scenes. Advances in Neural Information Processing Systems 35 (2022), 14959–14971.
  73. Ziyang Wang and Chao Ma. 2024. Semi-Mamba-UNet: Pixel-Level Contrastive Cross-Supervised Visual Mamba-based UNet for Semi-Supervised Medical Image Segmentation. arXiv preprint arXiv:2402.07245 (2024).
  74. NeRM: Learning Neural Representations for High-Framerate Human Motion Synthesis. In The Twelfth International Conference on Learning Representations.
  75. Omnicontrol: Control any joint at any time for human motion generation. arXiv preprint arXiv:2310.08580 (2023).
  76. Towards detailed text-to-motion synthesis via basic-to-advanced hierarchical diffusion model. arXiv preprint arXiv:2312.10960 (2023).
  77. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. arXiv preprint arXiv:2401.13560 (2024).
  78. Vivim: a video vision mamba for medical video object segmentation. arXiv preprint arXiv:2401.14168 (2024).
  79. Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling. In Proceedings of the 31st ACM International Conference on Multimedia. 3954–3964.
  80. Qing Yu and Kent Fujiwara. 2023. Frame-level label refinement for skeleton-based weakly-supervised action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 3322–3330.
  81. Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16010–16021.
  82. Language-guided Human Motion Synthesis with Atomic Actions. In Proceedings of the 31st ACM International Conference on Multimedia. 5262–5271.
  83. T2m-gpt: Generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052 (2023).
  84. Generating Human Motion From Textual Descriptions With Discrete Representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14730–14740.
  85. Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
  86. FineMoGen: Fine-Grained Spatio-Temporal Motion Generation and Editing. arXiv preprint arXiv:2312.15004 (2023).
  87. MotionGPT: Finetuned LLMs are General-Purpose Motion Generators. arXiv preprint arXiv:2306.10900 (2023).
  88. Motion Mamba: Efficient and Long Sequence Motion Generation with Hierarchical and Bidirectional Selective SSM. arXiv preprint arXiv:2403.07487 (2024).
  89. Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference. arXiv preprint arXiv:2403.14520 (2024).
  90. Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 509–519.
  91. AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond. arXiv preprint arXiv:2311.16468 (2023).
  92. Zixiang Zhou and Baoyuan Wang. 2023. Ude: A unified driving engine for human motion generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5632–5641.
  93. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xinghan Wang (13 papers)
  2. Zixi Kang (3 papers)
  3. Yadong Mu (42 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.