Spatiotemporal Co-attention Recurrent Neural Networks for Human-Skeleton Motion Prediction (1909.13245v2)

Published 29 Sep 2019 in cs.CV

Abstract: Human motion prediction aims to generate future motions based on the observed human motions. Witnessing the success of Recurrent Neural Networks (RNN) in modeling the sequential data, recent works utilize RNN to model human-skeleton motion on the observed motion sequence and predict future human motions. However, these methods did not consider the existence of the spatial coherence among joints and the temporal evolution among skeletons, which reflects the crucial characteristics of human motion in spatiotemporal space. To this end, we propose a novel Skeleton-joint Co-attention Recurrent Neural Networks (SC-RNN) to capture the spatial coherence among joints, and the temporal evolution among skeletons simultaneously on a skeleton-joint co-attention feature map in spatiotemporal space. First, a skeleton-joint feature map is constructed as the representation of the observed motion sequence. Second, we design a new Skeleton-joint Co-Attention (SCA) mechanism to dynamically learn a skeleton-joint co-attention feature map of this skeleton-joint feature map, which can refine the useful observed motion information to predict one future motion. Third, a variant of GRU embedded with SCA collaboratively models the human-skeleton motion and human-joint motion in spatiotemporal space by regarding the skeleton-joint co-attention feature map as the motion context. Experimental results on human motion prediction demonstrate the proposed method outperforms the related methods.

Citations (183)

View on Semantic Scholar

Summary

The paper introduces a novel Skeleton-joint Co-Attention mechanism that enhances both spatial and temporal feature learning for motion prediction.
It integrates this mechanism within a GRU-based SC-RNN architecture, yielding superior performance on the H3.6M dataset in complex motion scenarios.
The study utilizes a weighted gram-matrix loss that ensures high structural consistency between predicted and ground-truth skeletal motions.

Spatiotemporal Co-attention Recurrent Neural Networks for Human-Skeleton Motion Prediction

The paper presents a novel approach titled "Spatiotemporal Co-attention Recurrent Neural Networks (SC-RNN)" for human-skeleton motion prediction tasks. The research primarily focuses on enhancing the prediction of future human skeletal motions by utilizing the observed motion sequences more effectively. Traditionally, Recurrent Neural Networks (RNNs) have been employed for this application, demonstrating a robust capability in modeling sequential data. However, the major limitation observed in existing RNN-based methods is the inadequacy in capturing both spatial coherence among joints and temporal evolution among skeletons, which are crucial for accurately predicting human motion.

Key Contributions

Skeleton-joint Co-Attention Mechanism (SCA): The paper introduces a Skeleton-joint Co-Attention (SCA) mechanism, which is designed to simultaneously learn attention factors in both spatial and temporal dimensions. This mechanism enhances the ability to refine and utilize observed motion data, allowing for better future motion predictions. The approach dynamically learns a co-attention feature map, taking into account the importance of each joint and skeleton over time.
SC-RNN Architecture: By embedding the SCA within a variant of the Gated Recurrent Unit (GRU), the SC-RNN architecture is established. This configuration models the human-skeleton and joint motions in a cohesive manner across spatiotemporal space, providing improved predictive capabilities over conventional RNN architectures.
Weighted Gram-Matrix Loss: The research proposes a weighted gram-matrix loss for model training, which captures the structural dependencies between predicted and ground-truth motions. This loss formulation ensures that predicted skeletons maintain consistency and similarity with high correlation across time steps.

Experimental Results

The proposed SC-RNN achieves superior performance compared to other state-of-the-art methodologies on the H3.6M dataset, one of the largest benchmarks for human-skeleton motion prediction. Evaluations highlight SC-RNN's ability to outperform traditional methods, particularly in scenarios involving complex joint interactions and long-term motion forecasting. The empirical results underscore SC-RNN’s ability to effectively model and predict human motion by addressing both spatial and temporal dependencies.

Implications and Future Directions

The research carries significant implications for real-time applications involving human-computer interactions, virtual reality, and animation, where accurate motion prediction is essential. By providing a framework that can capture intricate motion patterns more effectively, SC-RNN could serve as a foundational model for future advancements in motion prediction tasks.

For future work, enhancing the scalability and efficiency of SC-RNN to handle larger and more diverse datasets remains an important avenue for exploration. Additionally, the integration of SC-RNN into multi-modal systems that combine skeletal data with other forms of sensory input, such as visual or auditory data, could yield more comprehensive models for understanding human activities.

In conclusion, the introduction of the SC-RNN represents a meaningful stride in addressing the complexities of human-skeleton motion prediction, providing a robust model that sets a new benchmark for future research in this domain.

PDF Markdown