Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video (1511.09439v2)

Published 30 Nov 2015 in cs.CV

Abstract: This paper addresses the challenge of 3D full-body human pose estimation from a monocular image sequence. Here, two cases are considered: (i) the image locations of the human joints are provided and (ii) the image locations of joints are unknown. In the former case, a novel approach is introduced that integrates a sparsity-driven 3D geometric prior and temporal smoothness. In the latter case, the former case is extended by treating the image locations of the joints as latent variables. A deep fully convolutional network is trained to predict the uncertainty maps of the 2D joint locations. The 3D pose estimates are realized via an Expectation-Maximization algorithm over the entire sequence, where it is shown that the 2D joint location uncertainties can be conveniently marginalized out during inference. Empirical evaluation on the Human3.6M dataset shows that the proposed approaches achieve greater 3D pose estimation accuracy over state-of-the-art baselines. Further, the proposed approach outperforms a publicly available 2D pose estimation baseline on the challenging PennAction dataset.

Authors (5)

Xiaowei Zhou (122 papers)
Menglong Zhu (18 papers)
Spyridon Leonardos (6 papers)
Kosta Derpanis (1 paper)
Kostas Daniilidis (119 papers)

Citations (421)

View on Semantic Scholar

Summary

The paper introduces a framework that combines sparse geometric priors with deep learning to resolve 3D human pose ambiguities from single-camera inputs.
It employs a dual approach where 2D joint positions are either given or treated as latent variables, estimated via a novel EM algorithm and CNN uncertainty maps.
Evaluations on the Human3.6M and PennAction datasets show improved accuracy with lower mean per joint errors compared to state-of-the-art methods.

Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video

The paper "Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video" presents a comprehensive framework for inferring 3D human poses from sequences captured by a single RGB camera. The problem addressed is significant due to the inherent ambiguities in recovering 3D articulated poses from monocular inputs, compounded by variations in human appearance, varying viewpoints, and occlusions. The research culminates in an approach that integrates sparsity-driven 3D geometric priors with novel deep learning techniques capable of estimating joint location uncertainties from 2D image sequences.

Key Components and Approach

The authors introduce a two-case scenario for pose estimation: first, when the 2D joint locations are known, and second, when they are regarded as latent variables. In the first scenario, the framework integrates 2D joint estimates from the image with model-based 3D pose reconstructions using a sparse representation of a learned pose dictionary. Temporal smoothness is applied to stabilize 3D pose and camera viewpoint parameters over time.

In addressing the case where joint locations are unknown, the paper extends the framework by treating these locations as latent variables. A convolutional neural network (CNN) is employed to predict uncertainty maps for these 2D joint positions, and a novel Expectation-Maximization (EM) algorithm is designed to robustly estimate the 3D poses across the sequence. This allows for the seamless marginalization of 2D joint location uncertainties during the inference process.

Evaluation and Results

This innovative framework is empirically evaluated using the Human3.6M and PennAction datasets, demonstrating superior performance over existing state-of-the-art methods. Specifically, the proposed method shows enhanced accuracy in 3D pose estimation, achieving lower mean per joint error rates compared to notable baselines. Noteworthy is its ability to outperform a publicly available 2D pose estimation method on the challenging PennAction dataset, highlighting the robustness and applicability of the proposed framework to unconstrained, real-world video sequences.

Implications and Future Directions

The implications of this paper are manifold. Practically, the ability to accurately determine 3D human poses from monocular video opens new pathways in fields such as virtual reality, human-computer interaction, and video surveillance. Theoretically, the integration of 2D detections with 3D geometric models and the treatment of joint location uncertainties in estimation frameworks is a promising direction for bridging the gap between 2D image cues and 3D pose inferences.

Looking forward, potential developments could include scaling the approach for real-time applications and extending it to track multiple subjects simultaneously. Moreover, exploring representations beyond sparse models and incorporating dynamic CNN architectures could further enhance the framework's robustness and adaptability. The paper thus not only contributes to the domain of computer vision and deep learning but also sets a foundational paradigm for subsequent research in 3D human pose estimation from monocular data.

PDF Markdown