Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation (2004.03686v3)

Published 7 Apr 2020 in cs.CV

Abstract: Differently from 2D image datasets such as COCO, large-scale human datasets with 3D ground-truth annotations are very difficult to obtain in the wild. In this paper, we address this problem by augmenting existing 2D datasets with high-quality 3D pose fits. Remarkably, the resulting annotations are sufficient to train from scratch 3D pose regressor networks that outperform the current state-of-the-art on in-the-wild benchmarks such as 3DPW. Additionally, training on our augmented data is straightforward as it does not require to mix multiple and incompatible 2D and 3D datasets or to use complicated network architectures and training procedures. This simplified pipeline affords additional improvements, including injecting extreme crop augmentations to better reconstruct highly truncated people, and incorporating auxiliary inputs to improve 3D pose estimation accuracy. It also reduces the dependency on 3D datasets such as H36M that have restrictive licenses. We also use our method to introduce new benchmarks for the study of real-world challenges such as occlusions, truncations, and rare body poses. In order to obtain such high quality 3D pseudo-annotations, inspired by progress in internal learning, we introduce Exemplar Fine-Tuning (EFT). EFT combines the re-projection accuracy of fitting methods like SMPLify with a 3D pose prior implicitly captured by a pre-trained 3D pose regressor network. We show that EFT produces 3D annotations that result in better downstream performance and are qualitatively preferable in an extensive human-based assessment.

Citations (150)

View on Semantic Scholar

Summary

The paper introduces Exemplar Fine-Tuning (EFT) to generate high-quality 3D pseudo-annotations from large-scale 2D datasets.
It leverages image-conditioned regression with auxiliary inputs like DensePose and segmentation maps to enhance accuracy under occlusions and truncations.
Models trained using EFT outperform state-of-the-art methods on benchmarks such as 3DPW, demonstrating robust real-world performance.

Exemplar Fine-Tuning for 3D Human Model Fitting

The paper "Exemplar Fine-Tuning for 3D Human Model Fitting" presents a novel approach to 3D human pose estimation using a method called Exemplar Fine-Tuning (EFT). Given the scarcity and limitations of 3D ground-truth datasets captured in real-world scenarios, the authors propose augmenting existing large-scale 2D datasets with high-quality 3D pseudo-ground truth annotations. This approach addresses the challenges of acquiring labeled 3D data in-the-wild and offers a streamlined training pipeline for 3D pose regressors, outperforming previous state-of-the-art approaches on benchmarks like 3DPW.

Key Contributions and Methodology

Exemplar Fine-Tuning (EFT) Introduction:
- EFT integrates the advantages of both regression-based and fitting-based methods. It optimizes the network weights to fit exemplar samples rather than directly adjusting model parameters. This approach implicitly captures 3D pose priors that are conditioned on image data.
Pseudo-ground Truth Generation:
- Using existing 2D datasets, EFT generates 3D pseudo-annotations that are used to train pose regressor networks. This eliminates the need for complex architectures or the integration of non-compatible datasets and reduces dependency on datasets with restrictive licenses like H36M.
Auxiliary Inputs and Augmentation:
- The training process benefits from incorporating data augmentations such as extreme crop augmentations, facilitating better handling of truncated human images. Additionally, auxiliary inputs like DensePose maps and segmentation maps were evaluated for their ability to improve 3D pose estimation accuracy.
Benchmarking and Evaluation:
- New benchmarks are proposed to assess model performance in real-world scenarios, such as occlusions and truncations. A human paper confirmed the qualitative superiority of EFT's pseudo-annotations over conventional SMPLify-based annotations.
Implication of Findings:
- The high-quality pseudo-annotations generated by EFT are shown to be sufficient to train state-of-the-art pose regressors from scratch, even surpassing previous methods trained on both 3D and 2D datasets. Models trained with EFT-based datasets achieve superior performance on challenging benchmarks and demonstrate the practical utility of 3D pose estimation in more diverse settings.

Implications and Future Directions

The introduction of EFT as a method for generating high-quality 3D pseudo-ground truth marks a significant step forward in 3D human pose estimation, providing a robust and efficient way to utilize existing 2D datasets. The streamlined training pipeline minimizes the reliance on complex methodologies and proprietary datasets.

Theoretically, EFT provides an empirical framework that leverages the re-projection accuracy of fitting methods while embedding image-conditioned pose priors through pre-trained regression models. This novel integration suggests potential future investigations into internal learning mechanisms and their application to broader AI applications beyond pose estimation.

Practically, EFT enables more accessible training and evaluation setups, potentially broadening the use cases of 3D human pose estimation in frequently encountered scenarios like occlusions and truncations. By publicizing this pseudo-annotated data along with clear evidence of its effectiveness, the paper encourages further exploration and validation within the research community.

In conclusion, EFT presents a compelling approach to addressing key challenges in 3D human modeling, providing both theoretical insights and practical tools for advancing the field of human pose estimation.

PDF Markdown

Related Papers

GitHub

GitHub - facebookresearch/eft: visualization code for 3D human body annotation by EFT (Exemplar Fine-tuning) (391 stars)