Multiscale Spatio-Temporal Graph Neural Networks for 3D Skeleton-Based Motion Prediction (2108.11244v1)

Published 25 Aug 2021 in cs.CV and cs.AI

Abstract: We propose a multiscale spatio-temporal graph neural network (MST-GNN) to predict the future 3D skeleton-based human poses in an action-category-agnostic manner. The core of MST-GNN is a multiscale spatio-temporal graph that explicitly models the relations in motions at various spatial and temporal scales. Different from many previous hierarchical structures, our multiscale spatio-temporal graph is built in a data-adaptive fashion, which captures nonphysical, yet motion-based relations. The key module of MST-GNN is a multiscale spatio-temporal graph computational unit (MST-GCU) based on the trainable graph structure. MST-GCU embeds underlying features at individual scales and then fuses features across scales to obtain a comprehensive representation. The overall architecture of MST-GNN follows an encoder-decoder framework, where the encoder consists of a sequence of MST-GCUs to learn the spatial and temporal features of motions, and the decoder uses a graph-based attention gate recurrent unit (GA-GRU) to generate future poses. Extensive experiments are conducted to show that the proposed MST-GNN outperforms state-of-the-art methods in both short and long-term motion prediction on the datasets of Human 3.6M, CMU Mocap and 3DPW, where MST-GNN outperforms previous works by 5.33% and 3.67% of mean angle errors in average for short-term and long-term prediction on Human 3.6M, and by 11.84% and 4.71% of mean angle errors for short-term and long-term prediction on CMU Mocap, and by 1.13% of mean angle errors on 3DPW in average, respectively. We further investigate the learned multiscale graphs for interpretability.

Citations (69)

View on Semantic Scholar

Summary

The paper proposes a novel MST-GNN that adaptively models multiscale spatial and temporal relationships in 3D skeleton data.
It employs cross-scale feature fusion and graph-based attention mechanisms to significantly reduce prediction errors on benchmark datasets.
The approach demonstrates robust short- and long-term motion predictions, paving the way for advanced applications in human-computer interaction and robotics.

Overview of "Multiscale Spatio-Temporal Graph Neural Networks for 3D Skeleton-Based Motion Prediction"

The paper introduces a novel approach using Multiscale Spatio-Temporal Graph Neural Networks (MST-GNN) to tackle the challenge of predicting 3D skeleton-based human poses across various action categories. The authors present a data-adaptive multiscale spatio-temporal graph that models both spatial and temporal relations within motion sequences, a significant departure from traditional hierarchical structures. This approach utilizes trainable graph structures to capture non-physical, yet motion-based relationships, enhancing both short-term and long-term predictions.

Methodology

The core of the MST-GNN is the Multiscale Spatio-Temporal Graph Computational Unit (MST-GCU), which is designed to leverage trainable graph structures for feature learning and integration:

Multiscale Spatio-Temporal Graph Construction: Unlike fixed or pre-defined structures, the paper's approach constructs graphs adaptively. Spatial and temporal graphs are dynamically adjusted during training, making them responsive to varying motion patterns across actions.
Single-Scale and Cross-Scale Feature Processing: MST-GCU performs operations at the graph's single scale through graph convolutions and across scales through fusion operations. This dual-level processing allows the model to both preserve detailed local information and integrate multi-scale contexts.
Graph-Based Decoder: In generating future poses, the model employs a Graph-Based Attention Gated Recurrent Unit (GA-GRU) that enhances pose generation by focusing on important joint features through attention mechanisms. This aspect is crucial in capturing salient motion dynamics for prediction.

Experimental Results

The MST-GNN's efficacy was benchmarked extensively on datasets such as Human 3.6M, CMU Mocap, and 3DPW:

On Human 3.6M, the MST-GNN achieved a mean angle error reduction of 5.33% for short-term predictions and 3.67% for long-term compared to prior models.
On CMU Mocap, improvements of 11.84% for short-term and 4.71% for long-term predictions were reported, surpassing state-of-the-art methods.
The approach demonstrated its adaptability and robustness on 3DPW, achieving a mean angle error reduction of 1.13%.

Implications and Future Directions

The MST-GNN offers significant contributions to the fields of computer vision and robotics, particularly in applications requiring nuanced human motion prediction such as human-computer interaction, autonomous driving, and surveillance systems. The adaptive graph-based approach paves the way for further explorations into non-linear and multi-scale data relationships. Future work could involve expanding this architecture to handle even more complex motion dynamics and exploring its application in other high-dimensional time series data beyond human motion.

In summary, the proposal of MST-GNN represents a substantial advancement in understanding and forecasting human motion. The model's ability to effectively capture multiscale spatio-temporal relationships through adaptive graph structures highlights a promising direction for research and application in dynamic pattern prediction.

PDF Markdown

Related Papers

YouTube

Show All Videos