Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach (1704.02447v2)

Published 8 Apr 2017 in cs.CV

Abstract: In this paper, we study the task of 3D human pose estimation in the wild. This task is challenging due to lack of training data, as existing datasets are either in the wild images with 2D pose or in the lab images with 3D pose. We propose a weakly-supervised transfer learning method that uses mixed 2D and 3D labels in a unified deep neutral network that presents two-stage cascaded structure. Our network augments a state-of-the-art 2D pose estimation sub-network with a 3D depth regression sub-network. Unlike previous two stage approaches that train the two sub-networks sequentially and separately, our training is end-to-end and fully exploits the correlation between the 2D pose and depth estimation sub-tasks. The deep features are better learnt through shared representations. In doing so, the 3D pose labels in controlled lab environments are transferred to in the wild images. In addition, we introduce a 3D geometric constraint to regularize the 3D pose prediction, which is effective in the absence of ground truth depth labels. Our method achieves competitive results on both 2D and 3D benchmarks.

Authors (5)

Xingyi Zhou (26 papers)
Qixing Huang (78 papers)
Xiao Sun (99 papers)
Xiangyang Xue (169 papers)
Yichen Wei (47 papers)

Citations (559)

View on Semantic Scholar

Summary

Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach

The paper presents a novel method for addressing the complex task of 3D human pose estimation in unconstrained environments, commonly referred to as "in the wild." This task is contentious due to the paucity of comprehensive datasets that feature both wild images and 3D pose annotations. Existing datasets typically segregate these features into either 2D annotations in the wild or 3D annotations in restricted lab settings.

Methodology

The authors introduce a weakly-supervised transfer learning methodology utilizing a unified deep neural network architecture with a two-stage cascaded structure. This architecture combines a state-of-the-art 2D pose estimation sub-network with a 3D depth regression sub-network. A notable strength of this approach is the end-to-end training capability, diverging from traditional methods that sequentially train the 2D and 3D components separately.

This integrated approach effectively leverages the correlation between the 2D pose estimation and depth regression tasks, allowing for enhanced feature learning through shared representations. Consequently, 3D pose labels confined to controlled lab environments are effectively transferred to in-the-wild images.

Geometric Constraints

A critical innovation in this paper is the introduction of a 3D geometric constraint designed to regularize 3D pose predictions in scenarios where ground truth depth labels are absent. The geometric constraint is predicated on maintaining consistent relative bone lengths across poses, offering an additional layer of regularization that significantly benefits the weak supervision framework.

Results and Evaluation

The proposed methodology demonstrates competitive performance on prominent 2D and 3D benchmarks. The results underscore significant improvements in both supervised and weakly-supervised contexts. Specifically, incorporating 2D and 3D datasets into a unified training process resulted in superior prediction accuracy, as evidenced by quantitative evaluations on datasets like Human3.6M and MPI-INF-3DHP. Notably, the transfer of the model to wild datasets displayed enhanced generalization, attributed to the innovative network design and well-considered constraints.

Implications and Future Directions

This research holds considerable promise for advancing human-computer interaction technologies and other applications reliant on accurate human pose estimation. The efficacy of integrating varied datasets offers a valuable blueprint for weakly-supervised learning in computer vision applications. Future explorations could delve into additional geometric or domain-alignment constraints to further enhance transfer learning techniques, potentially refining the accuracy and reliability of 3D human pose estimations across diverse environments.

In summary, the paper contributes a robust framework for addressing the challenges in 3D human pose estimation in the wild, showcasing the potential of weakly-supervised learning and end-to-end training architectures in expanding the applicability of pose estimation methodologies beyond controlled environments.

PDF Markdown

Related Papers

Find Related Papers