Monocular 3D Human Pose Estimation by Generation and Ordinal Ranking

Published 2 Apr 2019 in cs.CV and cs.LG | (1904.01324v2)

Abstract: Monocular 3D human-pose estimation from static images is a challenging problem, due to the curse of dimensionality and the ill-posed nature of lifting 2D-to-3D. In this paper, we propose a Deep Conditional Variational Autoencoder based model that synthesizes diverse anatomically plausible 3D-pose samples conditioned on the estimated 2D-pose. We show that CVAE-based 3D-pose sample set is consistent with the 2D-pose and helps tackling the inherent ambiguity in 2D-to-3D lifting. We propose two strategies for obtaining the final 3D pose- (a) depth-ordering/ordinal relations to score and weight-average the candidate 3D-poses, referred to as OrdinalScore, and (b) with supervision from an Oracle. We report close to state of-the-art results on two benchmark datasets using OrdinalScore, and state-of-the-art results using the Oracle. We also show that our pipeline yields competitive results without paired image-to-3D annotations. The training and evaluation code is available at https://github.com/ssfootball04/generative_pose.

Abstract PDF Upgrade to Chat

Citations (145)

View on Semantic Scholar

Summary

Overview of Monocular 3D Human Pose Estimation by Generation and Ordinal Ranking

The paper presents a novel approach for estimating 3D human poses from monocular images using a deep conditional variational autoencoder (CVAE). This research addresses the inherent ambiguities in lifting 2D pose representations to 3D by generating multiple plausible 3D pose candidates conditioned on the initial 2D pose estimation. The proposed method employs a Conditional Variational Autoencoder (CVAE) to generate this diverse set of 3D samples and utilizes ordinal depth relations to score and select the most consistent 3D pose with the observed 2D pose.

The authors implement two strategies for final 3D pose estimation: one utilizing depth-ordering through a function referred to as OrdinalScore, and another via a proposed Oracle, which acts as an upper-bound performance benchmark. The study demonstrates competitive results on established benchmark datasets—Human3.6M and HumanEva-I—reporting performance close to state-of-the-art with OrdinalScore and achieving superior results with Oracle supervision. Importantly, the CVAE-based approach remains competitive in scenarios lacking paired 2D-3D annotation, showcasing flexibility and robustness in varied training conditions.

Key Contributions

Generative 3D Pose Modeling: The paper presents an innovative approach using a CVAE model for generating multiple anatomically plausible 3D-pose candidates from a given 2D pose. This approach addresses the multi-modality and ambiguity issues of 2D-to-3D lifting, leveraging the capabilities of generative models.
Ordinal Scoring Mechanism: By predicting joint-ordinal relations from the input image and the estimated 2D-pose, the model effectively scores and aggregates the generated 3D pose candidates. This method resolves the challenge of selection among multiple generated candidates.
Training without Paired 3D Data: The proposed pipeline can be trained on modular components independently, allowing the 2D-to-3D translation model to learn from a separate MoCap dataset without needing paired image-to-3D annotations. This reduces dependency on expensive and labor-intensive 3D annotations with images.
High-Performance Upper-Bound via Oracle: The introduction of the Oracle as a supervisory benchmark provides a conceptually valuable upper-bound that frames the potential and effectiveness of the generative model within controlled conditions.

Experimental Results

The authors' experiments reveal that using OrdinalScore, the model achieves near state-of-the-art results on the Human3.6M dataset, while employing Oracle supervision leads to outperforming existing methods. The system's robustness is further demonstrated under an "Unpaired" setting where the training occurs on separate datasets, showcasing adaptability to domain shifts and scalability.

Implications and Future Directions

The proposed method introduces a flexible and modular pathway for 3D pose estimation that circumvents the need for comprehensive paired datasets and fosters innovations in efficient pose sampling techniques. The introduction of joint-ordinal relation matrices offers a compelling framework for improving consistency and resolving ambiguities inherent in 3D prediction tasks.

Looking forward, advancements in this domain could focus on further simplification of the model architecture to allow real-time implementations on limited hardware. Additionally, integrating temporal coherence across video frames could enhance stability and accuracy for dynamic human pose estimation tasks, aligning closely with real-world applications in telepresence, gaming, and advanced surveillance technologies. The blending of CVAE with deep probabilistic models opens promising directions for unsupervised pose estimation, pushing the boundaries of automated 3D human analysis in visual computing.

Markdown