Can 3D Pose be Learned from 2D Projections Alone? (1808.07182v1)

Published 22 Aug 2018 in cs.CV

Abstract: 3D pose estimation from a single image is a challenging task in computer vision. We present a weakly supervised approach to estimate 3D pose points, given only 2D pose landmarks. Our method does not require correspondences between 2D and 3D points to build explicit 3D priors. We utilize an adversarial framework to impose a prior on the 3D structure, learned solely from their random 2D projections. Given a set of 2D pose landmarks, the generator network hypothesizes their depths to obtain a 3D skeleton. We propose a novel Random Projection layer, which randomly projects the generated 3D skeleton and sends the resulting 2D pose to the discriminator. The discriminator improves by discriminating between the generated poses and pose samples from a real distribution of 2D poses. Training does not require correspondence between the 2D inputs to either the generator or the discriminator. We apply our approach to the task of 3D human pose estimation. Results on Human3.6M dataset demonstrates that our approach outperforms many previous supervised and weakly supervised approaches.

Authors (6)

Dylan Drover (2 papers)
Rohith MV (6 papers)
Ching-Hang Chen (4 papers)
Amit Agrawal (26 papers)
Ambrish Tyagi (9 papers)
Cong Phuoc Huynh (5 papers)

Citations (109)

View on Semantic Scholar

Summary

Insights on Learning 3D Human Pose from 2D Projections

The paper "Can 3D Pose be Learned from 2D Projections Alone?" presents a novel approach to 3D pose estimation using weakly supervised learning without the need for direct correspondence between 3D and 2D data points. In the field of computer vision, determining 3D human pose from a single 2D image poses significant challenges due to the inherent depth ambiguity and the ill-posed nature of the task. To address these challenges, the authors propose leveraging adversarial networks to learn 3D poses from 2D projections.

In contrast to fully supervised approaches that require extensive 3D annotations, this work utilizes a generative adversarial network (GAN) framework. Specifically, it uses a generator network to hypothesize 3D skeletons by predicting depths for given 2D pose landmarks. The generator's output undergoes random projections into 2D, which are then evaluated by a discriminator network. This setup discriminates between poses projected from generated 3D skeletons and those from real 2D distribution without the need for explicit 2D-3D correspondence during training.

A significant contribution of this work is the Random Projection layer that generates 2D projections from random orientations of the estimated 3D skeletons. This technique exploits the likelihood that plausible 3D structures maintain realistic appearances even from arbitrary perspectives. As such, the discriminator network learns to evaluate the authenticity of these projections, indirectly imposing a structural prior on the predicted 3D skeletons.

Performance evaluation using the Human3.6M dataset shows that the proposed method matches or exceeds other weakly supervised methods and even some supervised methods that rely on 2D-3D pairing. The results are reported under two protocols, with notable improvement in mean per joint position error (MPJPE) compared to existing benchmarks. For example, using ground truth 2D inputs, their method reduces error rates significantly, achieving an MPJPE of 34.2mm under Protocol 1. When 2D inputs from a pose detector (stacked hourglass) are used, the results remain competitive, maintaining robustness even in real-world images from datasets such as MPII and LSP.

Theoretical implications of this research underscore the potential for unsupervised or weakly supervised methods to bypass the need for extensive 3D datasets. Practically, this approach opens up the possibility of deploying 3D pose estimation systems in settings where annotated 3D data is scarce, expensive, or infeasible to collect. Future research directions could explore semi-supervised learning frameworks or integrate temporal information to enhance the stability and accuracy of predictions.

Overall, this work contributes significantly to the understanding and methodology of 3D pose estimation from 2D imagery, offering a path toward more accessible and scalable solutions in computer vision and related applications.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos