Human Motion Prediction via Spatio-Temporal Inpainting (1812.05478v2)

Published 13 Dec 2018 in cs.CV

Abstract: We propose a Generative Adversarial Network (GAN) to forecast 3D human motion given a sequence of past 3D skeleton poses. While recent GANs have shown promising results, they can only forecast plausible motion over relatively short periods of time (few hundred milliseconds) and typically ignore the absolute position of the skeleton w.r.t. the camera. Our scheme provides long term predictions (two seconds or more) for both the body pose and its absolute position. Our approach builds upon three main contributions. First, we represent the data using a spatio-temporal tensor of 3D skeleton coordinates which allows formulating the prediction problem as an inpainting one, for which GANs work particularly well. Secondly, we design an architecture to learn the joint distribution of body poses and global motion, capable to hypothesize large chunks of the input 3D tensor with missing data. And finally, we argue that the L2 metric, considered so far by most approaches, fails to capture the actual distribution of long-term human motion. We propose two alternative metrics, based on the distribution of frequencies, that are able to capture more realistic motion patterns. Extensive experiments demonstrate our approach to significantly improve the state of the art, while also handling situations in which past observations are corrupted by occlusions, noise and missing frames.

Citations (198)

View on Semantic Scholar

Summary

The paper presents a novel spatio-temporal inpainting approach using GANs to predict 3D human motion, capturing both pose and absolute positioning.
It employs a fully convolutional generator paired with three discriminators to maintain temporal coherence and enhance anthropomorphic realism.
The paper proposes frequency-based metrics as an alternative to L2, achieving superior accuracy in long-term motion predictions on standard benchmarks.

Overview of "Human Motion Prediction via Spatio-Temporal Inpainting"

The paper addresses the challenge of predicting 3D human motion by leveraging a Generative Adversarial Network (GAN) specifically designed to handle the spatio-temporal dynamics of human motion sequences. The authors present an approach that extends the capabilities of GANs in forecasting long-term human motion, emphasizing the prediction of both the pose and the absolute position of the human body.

Key Contributions

The paper makes three substantial contributions to the domain of human motion prediction:

Spatio-Temporal Representation: The authors represent human motion data using a spatio-temporal tensor of 3D skeleton coordinates. This transformation allows them to frame the prediction task as an inpainting problem, which aligns well with the strengths of GANs. This formulation addresses the limitations of previous methods that typically ignored global positioning.
Advanced GAN Architecture: The proposed GAN architecture consists of a fully convolutional generator that preserves temporal coherence and incorporates three independent discriminators. These discriminators enforce anthropomorphism and improve the realism of generated motion. Importantly, the generator hypothesizes large chunks of spatio-temporal data with missing information, enabling more effective long-term predictions.
Alternative to L2 Metric: Traditionally, the L2 metric has been used to measure distances between predicted and actual motion sequences. The authors argue that this metric fails in capturing the true distribution of human motion over long durations. As an alternative, they propose metrics based on frequency distributions, which are shown to capture more realistic patterns of motion.

Experimental Insights

Extensive experiments demonstrate that the proposed model outperforms existing methods, particularly in the context of long-term prediction (>2 seconds) and in scenarios where data is incomplete due to occlusions, noise, or missing frames. The qualitative results, as illustrated in the paper, show significant improvements in maintaining the realism of motion predictions even as the prediction horizons extend. Quantitatively, the proposed methods achieved improvements on established benchmarks, signifying enhanced modeling of human motion dynamics.

Implications and Future Directions

The implications of this research are notable both for practical applications and theoretical advancements in human motion analysis and prediction. Practically, the ability to accurately forecast human motion has potential applications in animation, virtual reality, and human-computer interaction. Theoretically, the authors' approach could inspire more research into adaptable GAN architectures for time-sequenced data and the pursuit of metrics beyond traditional error norms like L2.

Future work might explore the integration of additional modalities (e.g., skeletal data combined with visual cues) to further enhance prediction accuracy. Moreover, it will be critical to investigate the generalizability of these methods across different datasets and human motion scenarios. The paper sets a precedent for exploring GANs in temporal tasks, not only for motion prediction but potentially extending to other domains requiring sequence forecasting abilities.

PDF Markdown