Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoMask: Generative Masked Modeling of 3D Human Motions (2312.00063v1)

Published 29 Nov 2023 in cs.CV

Abstract: We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Language2pose: Natural language grounded pose forecasting. In 2019 International Conference on 3D Vision (3DV), pages 719–728. IEEE, 2019.
  2. Deep motifs and motion signatures. ACM Transactions on Graphics (TOG), 37(6):1–13, 2018.
  3. Teach: Temporal action composition for 3d humans. In 2022 International Conference on 3D Vision (3DV), pages 414–423. IEEE, 2022.
  4. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  5. Deep video generation, prediction and completion of human action sequences. In Proceedings of the European Conference on Computer Vision (ECCV), pages 366–382, 2018.
  6. Implicit neural representations for variable length human motion generation. In European Conference on Computer Vision, pages 356–372. Springer, 2022.
  7. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022.
  8. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  9. Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18000–18010, 2023.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  11. Regularized residual quantization: a multi-layer sparse dictionary learning approach. arXiv preprint arXiv:1705.00522, 2017.
  12. Synthesis of compositional animations from textual descriptions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1396–1406, 2021.
  13. Tm2d: Bimodality driven 3d dance generation via music-text integration. arXiv preprint arXiv:2304.02419, 2023.
  14. Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, pages 2021–2029, 2020.
  15. Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022a.
  16. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In European Conference on Computer Vision, pages 580–597. Springer, 2022b.
  17. Action2video: Generating videos of human 3d actions. International Journal of Computer Vision, 130(2):285–315, 2022c.
  18. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  19. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  20. Dance revolution: Long-term dance generation with music via curriculum learning. arXiv preprint arXiv:2006.06119, 2020.
  21. Motiongpt: Human motion as a foreign language. arXiv preprint arXiv:2306.14795, 2023.
  22. Flame: Free-form language-based motion synthesis & editing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8255–8263, 2023.
  23. Priority-centric human motion generation in discrete latent space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14806–14816, 2023.
  24. Mage: Masked generative encoder to unify representation learning and image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2142–2152, 2023.
  25. Trq: Ternary neural networks with residual quantization. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8538–8546, 2021.
  26. Performance guaranteed network acceleration via high-order residual quantization. In Proceedings of the IEEE International Conference on Computer Vision, pages 2584–2592, 2017.
  27. Generating animated videos of human activities from natural language descriptions. Learning, 2018(1), 2018.
  28. Being comes from not-being: Open-vocabulary text-to-motion generation with wordless training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23222–23231, 2023.
  29. Investigating pose representations and motion contexts modeling for 3d motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):681–697, 2022.
  30. Diversemotion: Towards diverse human motion generation via discrete diffusion. arXiv preprint arXiv:2309.01372, 2023.
  31. Posegpt: Quantization-based 3d human motion generation and forecasting. In European Conference on Computer Vision, pages 417–435. Springer, 2022.
  32. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5442–5451, 2019.
  33. Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9489–9497, 2019.
  34. Stacked quantizers for compositional vector compression. arXiv preprint arXiv:1411.2173, 2014.
  35. Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10985–10995, 2021.
  36. Temos: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision, pages 480–497. Springer, 2022.
  37. The kit motion-language dataset. Big data, 4(4):236–252, 2016.
  38. Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robotics and Autonomous Systems, 109:13–26, 2018.
  39. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  40. Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11050–11059, 2022.
  41. Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision, pages 358–374. Springer, 2022a.
  42. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022b.
  43. Edge: Editable dance generation from music. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 448–458, 2023.
  44. Neural discrete representation learning. Advances in Neural Information Processing Systems, 30, 2017.
  45. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
  46. Learning diverse stochastic human-action generators by learning smooth latent transitions. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 12281–12288, 2020.
  47. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023.
  48. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  49. T2m-gpt: Generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052, 2023a.
  50. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022.
  51. Remodiffuse: Retrieval-augmented motion diffusion model. arXiv preprint arXiv:2304.01116, 2023b.
  52. Motiongpt: Finetuned llms are general-purpose motion generators. arXiv preprint arXiv:2306.10900, 2023c.
  53. Ude: A unified driving engine for human motion generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5632–5641, 2023.
Citations (53)

Summary

  • The paper introduces a novel framework using hierarchical RVQ to iteratively refine motion tokens and reduce quantization errors.
  • The paper employs a Masked Transformer mechanism to predict missing tokens from text, enabling semantically aligned motion synthesis.
  • The approach achieves state-of-the-art FID scores on HumanML3D and KIT-ML, demonstrating its effectiveness in generating diverse 3D human motions.

Overview of "MoMask: Generative Masked Modeling of 3D Human Motions"

This paper presents MoMask, an innovative masked modeling framework designed to address the challenges associated with generating 3D human motions from textual descriptions, a task gaining traction in applications such as virtual reality and digital content generation. The authors highlight significant flaws in existing methods—specifically, vector quantization-induced errors and limitations of unidirectional decoding—therefore introducing MoMask, which utilizes hierarchical residual vector quantization (RVQ) and masked transformers to enhance both the fidelity and diversity of generated motions.

Key Contributions

MoMask distinguishes itself through several novel components:

  1. Hierarchical Residual Vector Quantization: By organizing human motion tokens in a hierarchical manner, MoMask effectively addresses the quantization errors typical of single-layer quantization. The RVQ model iteratively refines motion token accuracy by quantizing residuals of the encoded motion sequences, resulting in multiple layers where each subsequent layer represents finer motion details.
  2. Masked Transformer Mechanism: MoMask employs a Masked Transformer inspired by BERT. During training, motion tokens are randomly masked, and the transformer predicts these tokens based on textual descriptions, enabling efficient and high-quality motion generation during inference by iteratively filling in the missing tokens.
  3. Residual Transformer: This second transformer is responsible for generating residual tokens after the base layer sequence is complete. It progressively predicts subsequent layer tokens using lower-layer results, ensuring a comprehensive representation of motion details within a truncated number of iterations.

Numerical Results and Claims

MoMask exhibits impressive numerical results in its evaluation against benchmarks, achieving an FID of 0.045 on the HumanML3D dataset, compared to 0.141 from T2M-GPT. It also delivers significantly improved results on the KIT-ML dataset with an FID of 0.228. These findings indicate the approach's superior capability in producing semantically aligned and diverse 3D motion outputs relative to state-of-the-art methodologies.

Implications and Future Directions

The implications of MoMask are twofold:

  • Theoretical: MoMask's deployment of hierarchical RVQ and generative masked modeling provides a robust framework for future explorations in machine learning models dependent on high-fidelity reconstruction and generation via incremental refinement of quantized tokens.
  • Practical: Given its adaptable design and minimal requirement for model fine-tuning in related tasks like motion inpainting, MoMask is particularly well-suited for deployment in text-to-motion applications across various industry domains, including gaming and interactive media.

Looking forward, the integration of MoMask into broader AI systems could lead to more sophisticated interactive content generation solutions. Further research might explore enhancements in model efficiency and adaptability to even more diverse and complex forms of human motion, ultimately expanding the scope of potential applications.

Youtube Logo Streamline Icon: https://streamlinehq.com