Unsupervised 3D Pose Estimation with Geometric Self-Supervision (1904.04812v1)

Published 9 Apr 2019 in cs.CV

Abstract: We present an unsupervised learning approach to recover 3D human pose from 2D skeletal joints extracted from a single image. Our method does not require any multi-view image data, 3D skeletons, correspondences between 2D-3D points, or use previously learned 3D priors during training. A lifting network accepts 2D landmarks as inputs and generates a corresponding 3D skeleton estimate. During training, the recovered 3D skeleton is reprojected on random camera viewpoints to generate new "synthetic" 2D poses. By lifting the synthetic 2D poses back to 3D and re-projecting them in the original camera view, we can define self-consistency loss both in 3D and in 2D. The training can thus be self supervised by exploiting the geometric self-consistency of the lift-reproject-lift process. We show that self-consistency alone is not sufficient to generate realistic skeletons, however adding a 2D pose discriminator enables the lifter to output valid 3D poses. Additionally, to learn from 2D poses "in the wild", we train an unsupervised 2D domain adapter network to allow for an expansion of 2D data. This improves results and demonstrates the usefulness of 2D pose data for unsupervised 3D lifting. Results on Human3.6M dataset for 3D human pose estimation demonstrate that our approach improves upon the previous unsupervised methods by 30% and outperforms many weakly supervised approaches that explicitly use 3D data.

Authors (7)

Ching-Hang Chen (4 papers)
Ambrish Tyagi (9 papers)
Amit Agrawal (26 papers)
Dylan Drover (2 papers)
Rohith MV (6 papers)
Stefan Stojanov (14 papers)
James M. Rehg (91 papers)

Citations (193)

View on Semantic Scholar

Summary

Unsupervised 3D Pose Estimation with Geometric Self-Supervision

The paper presents a novel approach to 3D human pose estimation using unsupervised learning methods, particularly focusing on the recovery of 3D skeletons from 2D pose landmarks without any reliance on multi-view images, explicit 3D priors, or 2D-3D point correspondences. This research circumvents the need for 3D pose data, which is traditionally difficult and costly to obtain, by leveraging abundant 2D data and exploiting geometric self-supervision.

Methodology Overview

The proposed approach introduces a lifting network that converts 2D skeletal landmarks into 3D pose estimations. The backbone of this network is a geometric self-supervision mechanism which ensures consistency in the 2D and 3D domains through a lift-reproject-lift cycle. During training, 3D skeletons estimated from 2D inputs are reprojected into 2D under various camera positions to create synthetic 2D poses. These synthetic poses are then re-lifted to 3D and compared against the original 3D estimations using self-consistency as the loss metric. However, findings show that self-consistency alone does not suffice for generating realistic skeletons. Integrating a 2D pose discriminator into the framework addresses this limitation, helping the network discard implausible skeleton reconstructions.

A crucial aspect of the paper is the introduction of a 2D domain adapter that aligns poses from different source domains, allowing for more diverse and extensive training data. The domain adaptation process uses adversarial techniques to match the semantic distribution of 2D joints between source and target domains.

Results and Findings

The approach demonstrates significant improvements over existing unsupervised and several weakly supervised methods. On the Human3.6M dataset, the method achieves a substantial error reduction, surpassing previous unsupervised techniques by 30%. Furthermore, the addition of temporal consistency during training, when available, further enhances the accuracy of the 3D pose estimations.

On the MPI-INF-3DHP dataset, results comparable to fully supervised and weakly supervised methods were obtained, showcasing the approach's robustness across different datasets.

An interesting aspect of the paper is the analysis of geometric self-supervision adequacy. While initial results with self-consistency offered geometric plausibility, they lacked realism in outputs, which was effectively addressed by the interpolated adversarial approach.

Implications and Future Work

This research marks a significant stride in unsupervised 3D pose estimation. It offers a promising direction for leveraging sheer volumes of 2D data to train networks without needing expensive 3D human pose annotations. The findings underline the critical importance of combining geometric intuitions with adversarial feedback mechanisms to ensure both plausibility and realism in generated outputs.

Looking ahead, the presented method opens several avenues for future exploration. Integrating image data directly into the network to perform end-to-end learning and employing advanced techniques for handling incomplete 2D data during training are proposed as potential enhancements. Furthermore, the adaptability of the domain adaptation module suggests that cross-dataset learning could potentially enhance the model's robustness.

In sum, this paper's methodologies and findings are significant for researchers aiming to expand the boundaries of unsupervised learning in computer vision, providing practical insights into harnessing abundant data for complex task fulfiLLMent, thereby potentially revolutionizing approaches to human pose estimation.

PDF Markdown

Related Papers

Find Related Papers