PVRED: A Position-Velocity Recurrent Encoder-Decoder for Human Motion Prediction (1906.06514v2)

Published 15 Jun 2019 in cs.CV

Abstract: Human motion prediction, which aims to predict future human poses given past poses, has recently seen increased interest. Many recent approaches are based on Recurrent Neural Networks (RNN) which model human poses with exponential maps. These approaches neglect the pose velocity as well as temporal relation of different poses, and tend to converge to the mean pose or fail to generate natural-looking poses. We therefore propose a novel Position-Velocity Recurrent Encoder-Decoder (PVRED) for human motion prediction, which makes full use of pose velocities and temporal positional information. A temporal position embedding method is presented and a Position-Velocity RNN (PVRNN) is proposed. We also emphasize the benefits of quaternion parameterization of poses and design a novel trainable Quaternion Transformation (QT) layer, which is combined with a robust loss function during training. We provide quantitative results for both short-term prediction in the future 0.5 seconds and long-term prediction in the future 0.5 to 1 seconds. Experiments on several benchmarks show that our approach considerably outperforms the state-of-the-art methods. In addition, qualitative visualizations in the future 4 seconds show that our approach could predict future human-like and meaningful poses in very long time horizons. Code is publicly available on GitHub: \textcolor{red}{https://github.com/hongsong-wang/PVRNN}.

Citations (36)

View on Semantic Scholar

Summary

The paper presents a novel PVRED architecture that leverages both position and velocity data to address unnatural long-term pose predictions.
It employs quaternion parameterization with a dedicated transformation layer to overcome limitations like gimbal lock and improve stability.
Experimental results on Human3.6M and CMU datasets show significant improvements in long-term human motion prediction accuracy.

An Analysis of PVRED: A Position-Velocity Recurrent Encoder-Decoder for Human Motion Prediction

The paper presents a novel approach to human motion prediction, aiming to tackle the limitations of existing models that frequently succumb to predicting mean or unnatural poses in longer time horizons. The proposed Position-Velocity Recurrent Encoder-Decoder (PVRED) introduces innovative techniques to leverage both positional and velocity information of human motions while applying quaternion parameterization, contrary to the often-used exponential map in existing methods.

PVRED builds upon the conventional Recurrent Encoder-Decoder (RED) architecture by integrating three primary innovations: the full utilization of pose velocities and temporal positional information, the employment of quaternion for joint rotations, and the introduction of a Position-Velocity RNN. Herein, the positional embeddings inspired by natural language processing models enable a more effective temporal dependencies capture. This approach is well suited to modeling tasks that require long-term predictions, as it incorporates a sinusoids of varying frequencies for temporal encoding.

Moreover, the authors underline the substantial advantages of quaternion parameterization over exponential maps, specifically by designing a Quaternion Transformation (QT) layer that is seamlessly integrated into their network. The quaternion approach circumvents the gimbal lock problem and discontinuities associated with the exponential map, thereby offering a more stable and robust prediction framework. A robust loss function defined in the unit quaternion space is proposed, improving the training stability by minimizing angular differences between predicted and observed poses with an L1 loss.

The experimental evaluation uses two significant benchmarks, Human3.6M and the CMU Motion Capture dataset, both known for their complexity and diversity in human motion representation. The results indicate that the approach achieves superior performance over existing methods, especially in longer-term predictions beyond the 500-millisecond mark. Notably, PVRED demonstrates significant improvements in predicting (both qualitatively and quantitatively) human-like and natural poses, up to 4000 milliseconds ahead, surpassing the baseline and state-of-the-art models such as the Residual RNN and those applying transformer-based architectures.

This paper makes a strong case for the adoption of position embeddings and quaternion transformations in the field of human motion prediction. It reveals insights into the integration of multidisciplinary techniques—extending successful practices from other fields, such as NLP—to enhance model performance. Furthermore, the methodological robustness illustrated by the quaternion loss formulation invites both practitioners and theoreticians to rethink pose dynamics and rotation modeling in human motion prediction tasks.

In future developments, it's plausible to speculate on further incorporating graph neural networks or transformer architectures, given their inherent capacity to model complex joint dependencies and sequences. These integrations could potentially lead to even more refined capture and prediction of human motion nuances. However, the PVRED method as proposed demonstrates a formidable step towards achieving finer granularity and accuracy in human motion dynamics.

PDF Markdown

Related Papers

GitHub

GitHub - hongsong-wang/PVRNN: PVRNN for human motion prediction (13 stars)

YouTube

Show All Videos