TEMOS: Generating diverse human motions from textual descriptions (2204.14109v2)

Published 25 Apr 2022 in cs.CV and cs.CL

Abstract: We address the problem of generating diverse 3D human motions from textual descriptions. This challenging task requires joint modeling of both modalities: understanding and extracting useful human-centric information from the text, and then generating plausible and realistic sequences of human poses. In contrast to most previous work which focuses on generating a single, deterministic, motion from a textual description, we design a variational approach that can produce multiple diverse human motions. We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data, in combination with a text encoder that produces distribution parameters compatible with the VAE latent space. We show the TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions. We evaluate our approach on the KIT Motion-Language benchmark and, despite being relatively straightforward, demonstrate significant improvements over the state of the art. Code and models are available on our webpage.

PDF Abstract

An Analysis of TEMOS: Generating Diverse Human Motions from Textual Descriptions

In the field of generative modeling, the synthesis of 3D human motion from textual descriptions presents a particularly intriguing challenge due to the inherent complexities associated with natural language and motion semantics. The paper introduces TEMOS (Text-to-Motions), a generative framework that significantly advances this domain by enabling the production of diverse human motions conditioned on textual input.

Unique Approach and Methodology

TEMOS distinguishes itself from earlier approaches by its capacity to generate a variety of plausible motions for a single textual description rather than a single deterministic outcome. This is achieved through a novel integration of a text-conditioned generative model that leverages a variational autoencoder (VAE). The VAE is trained with human motion data, and is supplemented by a text encoder that outputs distribution parameters congruent with the VAE's latent space. This approach expands on the deterministic models by probabilistically sampling from a latent space to ensure multiple diverse animations.

The paper provides a significant enhancement in the handling of text-to-motion tasks by employing Transformer models to encode both motion sequences and textual descriptions. This sequence modeling through Transformers allows for a deeper contextual encoding of information, addressing common limitations of earlier autoregressive models, such as temporal drift leading to static pose generation. By encoding motion and text into a joint latent space, the model can reconstruct 3D poses in a manner that closely correlates with textual semantics.

Results and Evaluations

In terms of evaluation, the paper reports results on the KIT Motion-Language benchmark, showcasing superior performance over existing methods. TEMOS is shown to significantly reduce root joint error on average positional error (APE), demonstrating its efficacy in producing realistic motion trajectories. Importantly, this framework achieves this with a design that avoids complex architectures, making it more efficient to train and deploy. A notable aspect of the evaluations includes the model's shear ability to outperform with perceptual realism tasks, where human judgments validate the quality of generated motions.

Implications and Future Directions

The implications of this research are particularly relevant for industries reliant on motion synthesis, such as gaming and film production, where the automated and realistic generation of human body movements can lead to significant cost reductions. Moreover, the capacity to explore multiple motion outcomes from a text description broadens usability in applications that demand nuanced motion variations, like virtual reality and robotics.

Looking forward, the research suggests potential explorations in enhancing text-to-motion generation by incorporating domain-specific adaptations like contact dynamics, which would provide richer interactions between generated avatars and their environments. Additionally, the research opens up a pathway for advancing duration estimation and further improving aspects of temporal coherence and motion continuity.

In conclusion, TEMOS introduces a potent method for text-conditioned motion synthesis, shifting paradigms in the ability to generate diverse, non-autoregressive, and semantically coherent 3D motions. As with any evolving field, future work will likely address limitations like the management of out-of-distribution text and scaling the model to handle more extended sequences efficiently. The approach stands to contribute significantly to the broader discourse on machine learning applications in human-centric AI, potentially leading to more expressive and contextually rich virtual environs.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Mathis Petrovich (10 papers)
Michael J. Black (163 papers)
Gül Varol (39 papers)

Citations (304)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos