TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts (2207.01696v2)

Published 4 Jul 2022 in cs.CV

Abstract: Inspired by the strong ties between vision and language, the two intimate human sensing and communication modalities, our paper aims to explore the generation of 3D human full-body motions from texts, as well as its reciprocal task, shorthanded for text2motion and motion2text, respectively. To tackle the existing challenges, especially to enable the generation of multiple distinct motions from the same text, and to avoid the undesirable production of trivial motionless pose sequences, we propose the use of motion token, a discrete and compact motion representation. This provides one level playing ground when considering both motions and text signals, as the motion and text tokens, respectively. Moreover, our motion2text module is integrated into the inverse alignment process of our text2motion training pipeline, where a significant deviation of synthesized text from the input text would be penalized by a large training loss; empirically this is shown to effectively improve performance. Finally, the mappings in-between the two modalities of motions and texts are facilitated by adapting the neural model for machine translation (NMT) to our context. This autoregressive modeling of the distribution over discrete motion tokens further enables non-deterministic production of pose sequences, of variable lengths, from an input text. Our approach is flexible, could be used for both text2motion and motion2text tasks. Empirical evaluations on two benchmark datasets demonstrate the superior performance of our approach on both tasks over a variety of state-of-the-art methods. Project page: https://ericguo5513.github.io/TM2T/

PDF Abstract

TM2T: Stochastic and Tokenized Modeling for Reciprocal Generation of 3D Human Motions and Texts

The paper "TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts" provides a detailed exploration into the dual generation problem of translating text descriptions into 3D human motions and vice versa, employing an innovative method to overcome the traditional limitations of deterministic generation approaches. The paper addresses critical challenges inherent to these tasks, notably the generation of diverse motions from identical textual inputs and the avoidance of static, unengaging pose sequences.

Methodology

Central to the proposed method is the introduction of motion tokens as discrete and compact representations of 3D human motions. These motion tokens serve to harmonize the processing of both text and motion data, streamlining their conversion into tokenized forms and optimizing their reciprocal translation. In particular, the motion2text component is integrated into the text2motion training pipeline to penalize deviations from the original input text, enhancing the fidelity of the generated output. The approach leverages an adaptation of neural machine translation models to encode the sequence-to-sequence nature of these tasks, facilitating non-deterministic generation processes.

Moreover, the implementation employs autoregressive modeling over discrete motion tokens, which supports the production of pose sequences in varying lengths and distinct styles corresponding to the input text descriptions. The application of deep vector quantization enables the learning of semantically enriched motion representations, thus improving both the performance and flexibility of TM2T for text2motion and motion2text tasks.

Results

The paper reports notable performance improvements over existing methodologies, as evidenced by empirical evaluations conducted on two benchmark datasets: HumanML3D and KIT Motion-Language. Criteria for evaluation included a comprehensive set of linguistic and multimodal metrics. TM2T demonstrated superior capability in both motion captioning and text-based motion generation.

Implications and Future Directions

The implications of TM2T's findings are substantial for advancing the fidelity and diversity of machine-generated human motions and language descriptions, particularly within fields requiring interaction between humans and robots or virtual agents. The stochastic nature and token-based representation model introduce substantial enhancements over previous deterministic frameworks, indicating potential shifts in how reciprocal generation tasks could be approached.

Future work should focus on refining motion token representation quality, possibly exploring more advanced vector quantization techniques to enhance local context representation. Furthermore, extending the scope of text inputs to include more complex narrative structures could pose interesting challenges and opportunities for development.

Joint optimization strategies for the two directional mappings may also be explored to further exploit the reciprocal nature inherent in 3D human motion-language tasks. Additionally, addressing nuanced aspects of long and complex text inputs via more sophisticated neural architectures could pave the way for more comprehensive, real-world applications of this approach.

In summary, the TM2T framework offers promising advancements for the reciprocal generation of human motions and text, presenting both empirical and conceptual strides in addressing existing constraints and paving the way for future explorations and applications in artificial intelligence-driven motion synthesis.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Chuan Guo (77 papers)
Xinxin Zuo (25 papers)
Sen Wang (164 papers)
Li Cheng (74 papers)

Citations (168)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

TM2T: Reciprocal Generation of 3D Human Motions and Texts