Exploiting temporal information for 3D pose estimation (1711.08585v4)

Published 23 Nov 2017 in cs.CV

Abstract: In this work, we address the problem of 3D human pose estimation from a sequence of 2D human poses. Although the recent success of deep networks has led many state-of-the-art methods for 3D pose estimation to train deep networks end-to-end to predict from images directly, the top-performing approaches have shown the effectiveness of dividing the task of 3D pose estimation into two steps: using a state-of-the-art 2D pose estimator to estimate the 2D pose from images and then mapping them into 3D space. They also showed that a low-dimensional representation like 2D locations of a set of joints can be discriminative enough to estimate 3D pose with high accuracy. However, estimation of 3D pose for individual frames leads to temporally incoherent estimates due to independent error in each frame causing jitter. Therefore, in this work we utilize the temporal information across a sequence of 2D joint locations to estimate a sequence of 3D poses. We designed a sequence-to-sequence network composed of layer-normalized LSTM units with shortcut connections connecting the input to the output on the decoder side and imposed temporal smoothness constraint during training. We found that the knowledge of temporal consistency improves the best reported result on Human3.6M dataset by approximately $12.2\%$ and helps our network to recover temporally consistent 3D poses over a sequence of images even when the 2D pose detector fails.

Authors (2)

Mir Rayat Imtiaz Hossain (5 papers)
James J. Little (24 papers)

Citations (295)

View on Semantic Scholar

Summary

The paper introduces a sequence-to-sequence LSTM architecture with residual connections that models pose deviations for robust 3D human pose estimation.
It demonstrates a 12.2% error reduction over state-of-the-art methods by effectively leveraging temporal coherence and smoothness constraints.
The approach shows resilience against noisy 2D joint detections, offering promising applications in real-time sports analysis and virtual reality systems.

Leveraging Temporal Information in 3D Human Pose Estimation

The paper focuses on advancing the field of 3D human pose estimation from sequences of 2D joint locations by effectively employing temporal information. The primary novelty in this approach lies in treating pose estimation as a sequence-to-sequence problem, employing deep learning architectures that process sequences of 2D poses to output sequences of 3D poses.

Methodology

The authors utilize a sequence-to-sequence framework comprising layer-normalized Long Short-Term Memory (LSTM) units with residual connections, applied in a manner similar to neural machine translation models. These connections, inspired by ResNet architectures, help the model learn pose deviations rather than directly estimating the absolute pose in each frame, thereby simplifying the network's learning problem. The imposition of a temporal smoothness constraint during training is a key feature, enabling the model to yield temporally consistent pose sequences. The network processes input sequences of 2D joint locations, encoding them into a high-dimensional space at the end of the encoder LSTM block, with the decoder initialized from this state to output sequences of 3D joint locations.

Dataset and Implementation

For experimental validation, the Human3.6M dataset was predominantly used, a widely recognized benchmark in 3D pose estimation. In this context, the authors evaluated their model across two protocols: standard evaluation without post-processing, and evaluation with a similarity transformation aligning predicted poses with ground truth. Additionally, a key aspect of this work is its robustness against noisy 2D joint detections, demonstrated through experiments with additive Gaussian noise. The network model was fine-tuned to reduce errors across a broad spectrum of human activities recorded within the dataset.

Results and Analysis

Empirical results illustrate that exploiting temporal information significantly enhances pose estimation accuracy. The approach outperforms current state-of-the-art methods with an overall error reduction of approximately 12.2% on the Human3.6M dataset. Further comparison across various sequence lengths confirmed stable performance, indicating the model's flexibility and resilience. The error rates reported are particularly low for sequences with considerable occlusion or rapid motion — scenarios traditionally challenging for frame-by-frame estimation methods.

Practical and Theoretical Implications

The strong performance of this temporal approach holds practical significance for applications involving continuous human motion, such as sports analysis, surveillance, or real-time virtual reality systems. The method's robustness to noise also indicates potential in non-ideal conditions often assumed in real-world scenarios, where noisy input from 2D pose detectors is commonplace.

Theoretically, this paper emphasizes the utility of temporal coherence for refining architectural decisions in deep neural networks targeting sequential predictions, especially where prior assumptions about smoothness in the output domain can be leveraged. Future research may explore extending such models to handle multiple interacting bodies or perform simultaneous action classification.

Conclusion

The research presented in this paper illustrates a productive intersection of sequence modeling and computer vision for enhancing 3D human pose estimation. By efficiently leveraging temporal sequences and introducing smoothness constraints, this work not only sets a new benchmark on established datasets but also lays a foundation for future investigations leveraging temporal context in dynamic 3D object estimation tasks. Potential advancements could include employing such models in multi-view systems or adapting them for other forms of sequence-based data representations, unlocking new possibilities in machine perception.

PDF Markdown