MoMask: Generative Masked Modeling of 3D Human Motions (2312.00063v1)

Published 29 Nov 2023 in cs.CV

Abstract: We introduce MoMask, a novel masked modeling framework for text-driven 3D human motion generation. In MoMask, a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer, with a sequence of motion tokens obtained by vector quantization, the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens, a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage, starting from an empty sequence, our Masked Transformer iteratively fills up the missing tokens; Subsequently, a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset, and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning, such as text-guided temporal inpainting.

References (53)

Citations (53)

View on Semantic Scholar

Summary

The paper introduces a novel framework using hierarchical RVQ to iteratively refine motion tokens and reduce quantization errors.
The paper employs a Masked Transformer mechanism to predict missing tokens from text, enabling semantically aligned motion synthesis.
The approach achieves state-of-the-art FID scores on HumanML3D and KIT-ML, demonstrating its effectiveness in generating diverse 3D human motions.

Overview of "MoMask: Generative Masked Modeling of 3D Human Motions"

This paper presents MoMask, an innovative masked modeling framework designed to address the challenges associated with generating 3D human motions from textual descriptions, a task gaining traction in applications such as virtual reality and digital content generation. The authors highlight significant flaws in existing methods—specifically, vector quantization-induced errors and limitations of unidirectional decoding—therefore introducing MoMask, which utilizes hierarchical residual vector quantization (RVQ) and masked transformers to enhance both the fidelity and diversity of generated motions.

Key Contributions

MoMask distinguishes itself through several novel components:

Hierarchical Residual Vector Quantization: By organizing human motion tokens in a hierarchical manner, MoMask effectively addresses the quantization errors typical of single-layer quantization. The RVQ model iteratively refines motion token accuracy by quantizing residuals of the encoded motion sequences, resulting in multiple layers where each subsequent layer represents finer motion details.
Masked Transformer Mechanism: MoMask employs a Masked Transformer inspired by BERT. During training, motion tokens are randomly masked, and the transformer predicts these tokens based on textual descriptions, enabling efficient and high-quality motion generation during inference by iteratively filling in the missing tokens.
Residual Transformer: This second transformer is responsible for generating residual tokens after the base layer sequence is complete. It progressively predicts subsequent layer tokens using lower-layer results, ensuring a comprehensive representation of motion details within a truncated number of iterations.

Numerical Results and Claims

MoMask exhibits impressive numerical results in its evaluation against benchmarks, achieving an FID of 0.045 on the HumanML3D dataset, compared to 0.141 from T2M-GPT. It also delivers significantly improved results on the KIT-ML dataset with an FID of 0.228. These findings indicate the approach's superior capability in producing semantically aligned and diverse 3D motion outputs relative to state-of-the-art methodologies.

Implications and Future Directions

The implications of MoMask are twofold:

Theoretical: MoMask's deployment of hierarchical RVQ and generative masked modeling provides a robust framework for future explorations in machine learning models dependent on high-fidelity reconstruction and generation via incremental refinement of quantized tokens.
Practical: Given its adaptable design and minimal requirement for model fine-tuning in related tasks like motion inpainting, MoMask is particularly well-suited for deployment in text-to-motion applications across various industry domains, including gaming and interactive media.

Looking forward, the integration of MoMask into broader AI systems could lead to more sophisticated interactive content generation solutions. Further research might explore enhancements in model efficiency and adaptability to even more diverse and complex forms of human motion, ultimately expanding the scope of potential applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/1682070356339884032/status/1736105282038833476

https://twitter.com/176540776/status/1740424091230761356

https://twitter.com/919860212/status/1740420477502545966

https://twitter.com/1487345386226614278/status/1740917847382221004

https://twitter.com/1682070356339884032/status/1740216548198154553

https://twitter.com/1398654900/status/1740424780774600887

YouTube

Show All Videos