A Spatio-temporal Transformer for 3D Human Motion Prediction (2004.08692v3)

Published 18 Apr 2020 in cs.CV

Abstract: We propose a novel Transformer-based architecture for the task of generative modelling of 3D human motion. Previous work commonly relies on RNN-based models considering shorter forecast horizons reaching a stationary and often implausible state quickly. Recent studies show that implicit temporal representations in the frequency domain are also effective in making predictions for a predetermined horizon. Our focus lies on learning spatio-temporal representations autoregressively and hence generation of plausible future developments over both short and long term. The proposed model learns high dimensional embeddings for skeletal joints and how to compose a temporally coherent pose via a decoupled temporal and spatial self-attention mechanism. Our dual attention concept allows the model to access current and past information directly and to capture both the structural and the temporal dependencies explicitly. We show empirically that this effectively learns the underlying motion dynamics and reduces error accumulation over time observed in auto-regressive models. Our model is able to make accurate short-term predictions and generate plausible motion sequences over long horizons. We make our code publicly available at https://github.com/eth-ait/motion-transformer.

Authors (4)

Emre Aksan (18 papers)
Manuel Kaufmann (14 papers)
Peng Cao (32 papers)
Otmar Hilliges (120 papers)

Citations (200)

View on Semantic Scholar

Summary

The paper presents a novel model that decouples temporal and spatial attention to better capture joint dynamics in 3D motion.
It employs a fully autoregressive framework generating frame-by-frame predictions to reduce error accumulation in motion sequences.
Empirical results on AMASS and H3.6M datasets demonstrate superior short-term accuracy and sustained long-term motion plausibility.

A Spatio-Temporal Transformer for 3D Human Motion Prediction: An Academic Overview

The paper "A Spatio-temporal Transformer for 3D Human Motion Prediction" presents a novel approach to the task of generative modeling of 3D human motion using a Transformer-based architecture. Existing methodologies predominantly utilize Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs) within sequence-to-sequence frameworks. In contrast, this research capitalizes on the Transformer model, specifically designed to address key challenges inherent in 3D human motion prediction.

Core Contributions

The proposed architecture, termed the Spatio-Temporal Transformer (ST-Transformer), introduces several innovations:

Decoupled Spatio-Temporal Attention: Unlike traditional methods that address spatial and temporal aspects via a unified mechanism, the ST-Transformer independently models these dimensions. Temporal attention focuses on the historical frames of individual joints, while spatial attention captures inter-joint dependencies at a given timestep. This bifurcation enhances the model's ability to learn dynamics without the compression limitations imposed by RNN states.
Fully Auto-Regressive Model: The model generates predictions in a frame-by-frame manner, contrasting with some recent models that employ predetermined horizons through frequency-domain representations like Discrete Cosine Transform (DCT). This allows for flexible sequence generation without being bound to fixed-length windows.
Efficient Representation Learning: By leveraging the Transformer architecture's self-attention mechanism, the model explicitly reasons about dependencies, reducing cumulative error—a critical drawback faced by traditional autoregressive models.

Numerical Results and Performance

The paper rigorously evaluates the ST-Transformer using datasets such as AMASS and H3.6M. Key findings include:

Strong Short-Term Performance: The ST-Transformer achieves superior short-term prediction metrics on AMASS, outperforming RNN-based and DCT-based models across Euler, joint angle, and positional error measures.
Enhanced Long-Term Plausibility: It maintains plausible motion sequences over extended horizons (up to 20 seconds), particularly for periodic actions. The Power Spectrum (PS) metrics demonstrate a favorable alignment with ground-truth distributions over long durations, indicating effective mitigation of error accumulation.
Comparative Efficacy: Experiments show that the architecture's performance is consistently better than or on par with state-of-the-art methods on H3.6M, although the dataset's limited test size introduces variability.

Practical and Theoretical Implications

The ST-Transformer's design offers several implications for the field:

Improved Model Interpretability: By decoupling attention, the architecture provides insight into the temporal and spatial dependencies being learned, offering a clearer understanding of the decision-making process within the model.
Application Versatility: The architecture's adaptability hints at potential applications beyond traditional motion prediction, possibly enhancing tasks like human-robot interaction and virtual character animation where dynamic motion modeling is crucial.
Implications for Autoregressive Modeling: The model challenges the efficacy of traditional autoregressive paradigms, encouraging further exploration of hybrid models that combine frequency and temporal domain signals.

Future Directions

The research presents several avenues for future exploration:

Scalability and Complexity: Exploring methods to optimize the computational complexity of attention mechanisms in high-dimensional spaces, particularly for real-time applications, can enhance model deployment capabilities.
Multi-Modal Extensions: Integrating other data modalities, such as audio or environmental context, may enrich the model's interpretability and prediction accuracy.
Diversity of Prediction: Addressing challenges in non-periodic and aperiodic motion types remains a vital area of future research, potentially involving techniques for sequence diversity augmentation.

In summary, the paper provides a sophisticated framework for 3D human motion prediction, contributing significant architectural advancements and setting a robust foundation for further research in generative human motion modeling.

PDF Markdown

Related Papers

GitHub

GitHub - eth-ait/motion-transformer: A Spatio-temporal Transformer for 3D Human Motion Prediction (107 stars)