Action-Conditioned 3D Human Motion Synthesis with Transformer VAE (2104.05670v2)

Published 12 Apr 2021 in cs.CV

Abstract: We tackle the problem of action-conditioned generation of realistic and diverse human motion sequences. In contrast to methods that complete, or extend, motion sequences, this task does not require an initial pose or sequence. Here we learn an action-aware latent representation for human motions by training a generative variational autoencoder (VAE). By sampling from this latent space and querying a certain duration through a series of positional encodings, we synthesize variable-length motion sequences conditioned on a categorical action. Specifically, we design a Transformer-based architecture, ACTOR, for encoding and decoding a sequence of parametric SMPL human body models estimated from action recognition datasets. We evaluate our approach on the NTU RGB+D, HumanAct12 and UESTC datasets and show improvements over the state of the art. Furthermore, we present two use cases: improving action recognition through adding our synthesized data to training, and motion denoising. Code and models are available on our project page.

Authors (3)

Mathis Petrovich (10 papers)
Michael J. Black (163 papers)
Gül Varol (39 papers)

Citations (427)

View on Semantic Scholar

Summary

An Analysis of Action-Conditioned 3D Human Motion Synthesis with Transformer VAE

The paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE" presents a novel approach for generating realistic and diverse human motion sequences conditioned on categorical action labels and variable durations, leveraging a Transformer-based variational autoencoder (VAE). The authors introduce a model named ACTOR which marks a departure from prior autoregressive methods by utilizing a single latent representation for the entire sequence, thus enabling the synthesis of variable-length motions with consistent realism and diversity.

Technical Contribution

The key contribution of the paper is the integration of a Transformer architecture with a VAE framework for 3D human motion synthesis. Unlike traditional methods that rely on initial sequences or single poses, the proposed model does not require prior motion input, thus broadening its applicability in domains such as virtual reality and character animation where diverse and context-specific motion sequences are desired.

The model leverages the SMPL parametric body model to synthesize body poses and root translations, providing output as joint locations or full body surfaces. This flexibility supports a variety of loss functions, including joint location, surface point reconstruction, and kinematic part rotation constraints. The paper posits that a combination of these loss functions yields more realistic motion sequences.

A significant innovation is the introduction of positional encodings within the Transformer architecture to generate variable-length sequences. This resolves issues common in autoregressive models, such as pose regression to the mean and motion drift. Additionally, the model’s design facilitates a sequence-level latent space, distinguishing it from other frame-level approaches like Action2Motion, leading to superior performance on diverse datasets including NTU RGB+D, HumanAct12, and UESTC.

Evaluation and Results

The model's efficacy is illustrated through extensive evaluation on aforementioned datasets. Metrics such as Fréchet Inception Distance (FID) and action recognition accuracy indicate improvements over state-of-the-art methods. Ablation studies reinforce the importance of positional encoding and sequence-level modeling in achieving both realism and diversity in generated motion sequences. Notably, the model demonstrates practical utility in augmenting datasets for action recognition tasks, emphasizing the significance of synthesized data in low-data regimes.

Challenges specific to action-conditioned generation, such as limited labeled motion capture data, are addressed through the use of monocular motion estimation to generate 3D sequences. While these estimates are noisy, the encoder-decoder architecture of the VAE inherently denoises the input, offering potential applications in motion prediction and augmentation for action recognition systems.

Implications and Future Directions

The implications of this work are extensive, with direct applications in fields requiring synthetic human-like motion, from gaming and film to human-robot interaction and immersive virtual environments. The ability to generate sequences based on simple action labels makes it a valuable toolkit for animation and simulation, removing dependence on costly motion capture processes.

The paper proposes future exploration in leveraging the compact latent space as a prior in motion estimation tasks, potentially improving the accuracy and speed of pose detection systems. Furthermore, open-vocabulary action generation represents a promising frontier, which could be enabled by further advancements in video motion estimation technologies.

In conclusion, the fusion of Transformer architectures with VAE in ACTOR provides a compelling framework for action-conditioned human motion synthesis, marking a significant advancement in the field of computer-generated motion modeling. The thorough experimentation and promising outcomes suggest that this approach will be influential in the ongoing development of systems for realistic human motion synthesis.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos