Exploiting temporal context for 3D human pose estimation in the wild (1905.04266v1)

Published 10 May 2019 in cs.CV

Abstract: We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos. Unlike previous algorithms which operate on single frames, we show that reconstructing a person over an entire sequence gives extra constraints that can resolve ambiguities. This is because videos often give multiple views of a person, yet the overall body shape does not change and 3D positions vary slowly. Our method improves not only on standard mocap-based datasets like Human 3.6M -- where we show quantitative improvements -- but also on challenging in-the-wild datasets such as Kinetics. Building upon our algorithm, we present a new dataset of more than 3 million frames of YouTube videos from Kinetics with automatically generated 3D poses and meshes. We show that retraining a single-frame 3D pose estimator on this data improves accuracy on both real-world and mocap data by evaluating on the 3DPW and HumanEVA datasets.

Citations (226)

View on Semantic Scholar

Summary

The paper leverages temporal context and bundle adjustment to significantly enhance 3D human pose estimation from monocular video sequences.
It refines per-frame SMPL model predictions by enforcing temporal smoothness, reducing MPJPE by 9.4% on the Human 3.6M dataset.
The approach scales to real-world data with a large automated dataset, improving pose estimation accuracy for diverse and challenging scenes.

Exploiting Temporal Context for 3D Human Pose Estimation in the Wild

The paper presents a novel approach for 3D human pose estimation by leveraging temporal context in monocular video sequences. The authors propose a bundle-adjustment-based method that exploits temporal consistency across video frames to resolve ambiguities in monocular 3D pose estimation, which is traditionally an under-constrained problem when only a single image is used. This approach addresses the limitations of current state-of-the-art methods which usually process frames independently without utilizing the rich temporal information available in videos.

Methodology

The proposed algorithm utilizes temporal coherence to improve 3D pose estimation by jointly optimizing across all frames in a video. This method achieves temporal consistency by enforcing continuity in body shape and slow variation in joint positions over time using a bundle adjustment technique adapted from multi-view geometry. Unlike existing methods that rely solely on single-frame predictions, this approach incorporates information from the entire video sequence, thereby preventing the accumulation of errors and improving robustness against occlusions, unusual poses, and challenging lighting conditions.

A distinctive feature of the methodology is the use of a parametric 3D model (SMPL) that represents human body shape and pose. This model is initialized using state-of-the-art per-frame estimations and refined by optimizing a cost function that includes terms for reprojection error, a temporal smoothness constraint, and priors on plausible human poses.

Experimental Evaluations

The proposed method was evaluated on both standard motion capture (mocap) datasets such as Human 3.6M and in-the-wild datasets like Kinetics. It demonstrated a significant improvement over previous state-of-the-art methods on these datasets. For instance, on Human 3.6M, the method achieved a substantial reduction in mean per joint position error (MPJPE) by 9.4% from the initial estimates provided by the single-frame HMR model. This enhancement highlights the efficacy of leveraging temporal consistency.

Additionally, their approach was used to generate a large-scale dataset of over 3 million frames from YouTube videos in the Kinetics dataset, with automatically generated 3D poses and meshes. By retraining a single-frame 3D pose estimator with this dataset, the authors reported improvements in the accuracy of 3D pose estimations on both real-world and mocap datasets, such as 3DPW and HumanEVA. This endeavor is one of the first to utilize vast amounts of unlabelled real-world video data for enhancing the performance of 3D pose estimation models.

Implications and Future Directions

The implications of this work are manifold. Practically, the improved accuracy in 3D human pose estimation can significantly impact applications in areas like augmented reality, animation, and human-computer interaction, where understanding human body configurations is crucial. Theoretically, this work sets the stage for further exploration of temporal constraints in video processing and the development of methods that can generalize across different types of motion and contextual shifts in the environment.

Future research could explore more intricate physical interactions in scenes, such as human-object interactions and ground-truth physics, to constrain and predict human poses more accurately. Additionally, the continual improvement of 2D and 3D pose detection in varying lighting conditions and complex scenes remains an exciting avenue for further exploration.

This paper represents an important step in bridging the gap between laboratory-based datasets and real-world applications by demonstrating how temporal information in videos can dramatically improve the fidelity of 3D pose estimation systems in uncontrolled settings.

PDF Markdown