Ordinal Depth Supervision for 3D Human Pose Estimation (1805.04095v1)

Published 10 May 2018 in cs.CV

Abstract: Our ability to train end-to-end systems for 3D human pose estimation from single images is currently constrained by the limited availability of 3D annotations for natural images. Most datasets are captured using Motion Capture (MoCap) systems in a studio setting and it is difficult to reach the variability of 2D human pose datasets, like MPII or LSP. To alleviate the need for accurate 3D ground truth, we propose to use a weaker supervision signal provided by the ordinal depths of human joints. This information can be acquired by human annotators for a wide range of images and poses. We showcase the effectiveness and flexibility of training Convolutional Networks (ConvNets) with these ordinal relations in different settings, always achieving competitive performance with ConvNets trained with accurate 3D joint coordinates. Additionally, to demonstrate the potential of the approach, we augment the popular LSP and MPII datasets with ordinal depth annotations. This extension allows us to present quantitative and qualitative evaluation in non-studio conditions. Simultaneously, these ordinal annotations can be easily incorporated in the training procedure of typical ConvNets for 3D human pose. Through this inclusion we achieve new state-of-the-art performance for the relevant benchmarks and validate the effectiveness of ordinal depth supervision for 3D human pose.

Citations (343)

View on Semantic Scholar

Summary

The paper proposes using ordinal depth annotations as a weak supervisory signal to train ConvNets for accurate 3D human pose estimation.
It demonstrates that augmenting 2D pose datasets with relative depth cues achieves competitive performance compared to models trained with complete 3D annotations.
Empirical results on benchmarks like Human3.6M highlight the approach's robustness and potential for practical applications in diverse imaging conditions.

Overview of "Ordinal Depth Supervision for 3D Human Pose Estimation"

The paper "Ordinal Depth Supervision for 3D Human Pose Estimation" presents a methodological advancement in the field of 3D human pose estimation by utilizing ordinal depth annotations as a form of weak supervision. The primary motivation behind this approach is to address the scarcity of high-quality 3D annotations available for natural images, a limitation that constrains the training of end-to-end systems for 3D human pose estimation. The authors suggest employing ordinal depth relations, which are simpler for human annotators to provide, as an alternative supervisory signal to detailed 3D ground truths typically obtained in controlled studio environments.

Key Contributions

Ordinal Depth Supervision: The authors propose the use of ordinal depth relations among human joints to enable the training of Convolutional Networks (ConvNets) for 3D human pose estimation. Unlike traditional annotations that offer precise 3D joint coordinates, ordinal relations indicate relative depth ("closer" or "farther") between joint pairs, making them easier for human annotators to determine, particularly in in-the-wild images.
Augmentation of Popular Datasets: The paper illustrates the potential of this approach by augmenting well-known 2D human pose datasets, such as Leeds Sports Pose (LSP) and MPII, with ordinal depth annotations. This allows for both quantitative and qualitative evaluations beyond the confines of controlled studio environments typical in existing 3D datasets.
Competitiveness with Full Supervision: Empirical evaluations demonstrate that ConvNets trained using ordinal depth relations can achieve competitive performance compared to those trained with precise 3D annotations, across various network configurations. This establishes a compelling case for using ordinal depth as a practical alternative to full 3D data.
Integration with Volumetric Representations: The research explores integrating ordinal relations within a volumetric representation of the 3D space. This approach maintains the spatial consistency of joint predictions while leveraging weak supervisory signals effectively.
Enhancement of Reconstruction Approaches: The inclusion of a reconstruction component that utilizes estimated 2D keypoints and ordinal depth information allows the production of coherent 3D pose estimates. This integration exemplifies how ordinal depth supervision can complement and strengthen existing models by resolving depth ambiguities.

Numerical Evaluation and Results

The paper reports new state-of-the-art results on important benchmarks such as Human3.6M and HumanEva-I, claiming a significant reduction in error metrics due to the inclusion of ordinal depth annotations in the training process. Notably, the model demonstrates robustness to domain shifts, as evidenced by performance on the MPI-INF-3DHP dataset, which displays a broader and more challenging range of imaging conditions, including outdoor environments.

Implications and Future Directions

From a practical application standpoint, this method reduces the dependency on elaborate equipment for obtaining 3D ground truth, such as motion capture systems. The ability to leverage comparative depth metrics opens opportunities for scaling 3D pose estimation models across diverse environments efficiently. Theoretically, this work invites further exploration into weak supervision frameworks, potentially extending to other vision tasks that are similarly constrained by limited high-quality labels.

Future developments may investigate automated techniques for inferring ordinal depth relations or expanding the use of this weak supervision paradigm to additional visual understanding tasks. Such directions would exploit the simplicity and scalability of ordinal annotations while pushing the boundaries of current computer vision systems' generalization capabilities.

Overall, the proposed framework for 3D human pose estimation provides a versatile and effective adaptation to conventional training methodologies, advocating for a shift in how large-scale, diverse datasets can be utilized in advancing convex learning architectures.

PDF Markdown