On human motion prediction using recurrent neural networks (1705.02445v1)

Published 6 May 2017 in cs.CV

Abstract: Human motion modelling is a classical problem at the intersection of graphics and computer vision, with applications spanning human-computer interaction, motion synthesis, and motion prediction for virtual and augmented reality. Following the success of deep learning methods in several computer vision tasks, recent work has focused on using deep recurrent neural networks (RNNs) to model human motion, with the goal of learning time-dependent representations that perform tasks such as short-term motion prediction and long-term human motion synthesis. We examine recent work, with a focus on the evaluation methodologies commonly used in the literature, and show that, surprisingly, state-of-the-art performance can be achieved by a simple baseline that does not attempt to model motion at all. We investigate this result, and analyze recent RNN methods by looking at the architectures, loss functions, and training procedures used in state-of-the-art approaches. We propose three changes to the standard RNN models typically used for human motion, which result in a simple and scalable RNN architecture that obtains state-of-the-art performance on human motion prediction.

Citations (876)

View on Semantic Scholar

Summary

The paper demonstrates that a simple zero-velocity baseline can match or exceed current RNN models in short-term prediction tasks.
It introduces a Seq2Seq framework, sampling-based loss, and residual architecture to address error propagation and improve motion smoothness.
Empirical results on the Human 3.6M dataset show significant gains in prediction accuracy and scalability, benefiting diverse applications.

On Human Motion Prediction Using Recurrent Neural Networks

"On Human Motion Prediction Using Recurrent Neural Networks" by Julieta Martinez, Michael J. Black, and Javier Romero addresses the challenging problem of predicting human motion using data-driven approaches with deep learning, particularly Recurrent Neural Networks (RNNs). The research highlights both the potential and limitations of current RNN-based methods for achieving accurate and smooth motion predictions, proposing substantial methodological improvements.

The paper begins by identifying human motion modeling as a problem central to applications ranging from human-computer interaction to motion synthesis for computer graphics and virtual reality. Traditional methods, which often rely on expert knowledge and assumptions such as Markovian properties or smoothness constraints, are contrasted with recent strategies leveraging deep learning. Specifically, the authors focus on RNNs due to their innate ability to model time-dependent data. Despite the success of deep learning in various vision tasks, the authors note that current RNN models still suffer from several shortcomings.

One of the core contributions of this paper is the empirical finding that a simple zero-velocity baseline, which predicts the last observed pose indefinitely, matches or outperforms state-of-the-art RNN-based models on short-term motion prediction tasks. This observation underscores a significant gap in the effectiveness of existing methodologies and motivates the need for improved approaches.

To address the shortcomings, the authors propose three key modifications:

Sequence-to-Sequence (Seq2Seq) Architecture: Traditional approaches train RNN models to predict frame-by-frame, leveraging ground truth at each time step during training. The authors instead adopt a Seq2Seq framework, wherein separate encoding and decoding phases are defined. During training, the decoder network feeds on its own predictions, thus naturally handling prediction errors similar to the actual test scenario.
Sampling-Based Loss: Rather than complicate the training with noise scheduling, the authors employ a loss function that operates by feeding the network's own predictions back, enhancing robustness.
Residual Architecture: Building on the observation that modeling velocities (first-order motion derivatives) rather than absolute positions can smooth predictions, a residual connection is introduced to represent pose velocities directly.

Empirical results indicate that these changes yield a more scalable and effective RNN architecture for human motion prediction. The proposed model—notably simpler than previous multi-layer LSTM and specialized structural models—achieves state-of-the-art performance, particularly on short-term prediction tasks.

Evaluation and Results

The authors' methodology was tested extensively using the Human 3.6M dataset, a large corpus of motion capture data. The evaluation metrics were twofold: short-term prediction error (quantitative) and long-term motion viability (qualitative). The proposed model not only lowers short-term prediction errors significantly but also generates smoother and more plausible long-term motion sequences. This scalability was demonstrated by training a single model on multiple actions, a departure from traditional action-specific models. The residual architecture, modeled on velocities, mitigated the severe discontinuity problem observed in earlier works.

Implications and Future Work

Practically, the improved human motion prediction models can significantly enhance various applications. In computer vision, better short-term predictions enable robust tracking systems, crucial for interactive environments and surveillance. For graphics and animation, the ability to generate smooth and realistic human motion over longer durations can advance virtual reality and gaming experiences.

Theoretically, this paper sheds light on several areas for future research:

Improving Long-Term Predictions: While short-term accuracy saw marked improvements, ensuring long-term plausibility remains challenging. Future work might explore integrating diverse loss functions or adversarial training to balance short- and long-term prediction quality.
Leveraging Larger Datasets: The authors relate the performance boost to the availability of extensive training data, suggesting an open avenue for developing even larger, perhaps semi-supervised, learning frameworks that can harness vast motion datasets.
Generalization Across Tasks: Since the network architecture is not entirely tailored to specific motion actions, there's potential for applying similar methodologies to other time-dependent prediction problems across different domains.

Overall, the paper's contributions push the boundaries of deep learning-based human motion modeling, offering both practical improvements and theoretical insights that can guide future exploration in motion prediction and related fields.

PDF Markdown

Related Papers

YouTube

Show All Videos