Three Recipes for Better 3D Pseudo-GTs of 3D Human Mesh Estimation in the Wild (2304.04875v1)

Published 10 Apr 2023 in cs.CV

Abstract: Recovering 3D human mesh in the wild is greatly challenging as in-the-wild (ITW) datasets provide only 2D pose ground truths (GTs). Recently, 3D pseudo-GTs have been widely used to train 3D human mesh estimation networks as the 3D pseudo-GTs enable 3D mesh supervision when training the networks on ITW datasets. However, despite the great potential of the 3D pseudo-GTs, there has been no extensive analysis that investigates which factors are important to make more beneficial 3D pseudo-GTs. In this paper, we provide three recipes to obtain highly beneficial 3D pseudo-GTs of ITW datasets. The main challenge is that only 2D-based weak supervision is allowed when obtaining the 3D pseudo-GTs. Each of our three recipes addresses the challenge in each aspect: depth ambiguity, sub-optimality of weak supervision, and implausible articulation. Experimental results show that simply re-training state-of-the-art networks with our new 3D pseudo-GTs elevates their performance to the next level without bells and whistles. The 3D pseudo-GT is publicly available in https://github.com/mks0601/NeuralAnnot_RELEASE.

Citations (4)

View on Semantic Scholar

Summary

The paper presents three novel recipes that improve 3D pseudo-ground truths for human mesh estimation in unstructured environments.
It leverages small ITW datasets with true 3D data and pre-trains on 2D pose estimation to mitigate depth ambiguity and supervision limitations.
The work demonstrates that employing VPoser and L2 regularization produces anatomically plausible meshes with significantly reduced estimation errors.

Three Recipes for Better 3D Pseudo-GTs for Human Mesh Estimation in the Wild

The paper "Three Recipes for Better 3D Pseudo-GTs for Human Mesh Estimation in the Wild" presents an in-depth analysis and proposes methodologies aimed at improving 3D pseudo-ground truths (pseudo-GTs) for 3D human mesh estimation in unstructured environments, where only 2D-based information is accessible through in-the-wild (ITW) datasets. Recognizing the inherent challenges in obtaining reliable 3D ground truths due to the lack of specialized equipment in everyday settings, the researchers identify key deficiencies and address them through three strategic approaches, which are termed as 'recipes.'

Challenges and Proposed Solutions

The main obstacles in generating 3D pseudo-GTs include depth ambiguity, sub-optimal results from weak 2D-based supervision, and anatomical implausibility in the estimated articulation of human meshes. The paper methodically addresses each challenge:

Depth Ambiguity: The ambiguity arises when converting 2D representations into 3D meshes due to multiple possible depth configurations that can yield the same 2D projection. To alleviate this, the authors suggest using even small ITW datasets with actual 3D ground truths, such as the Thirty-Frames Per Watch (3DPW), which provide accurate 3D information using IMUs. This proves more effective than relying solely on large datasets with only 2D annotations.
Sub-Optimal Results of Weak Supervision: Given the limitations of supervising with 2D data only, the authors demonstrate that pre-training the annotation network on a 2D pose estimation task can significantly improve the outcome. Specifically, initializing the network with weights from a 2D pose estimation network allows for improved feature extraction related to human articulation, thus reducing the resultant 3D errors.
Anatomically Implausible Articulations: Through careful network architecture considerations, specifically including VPoser—a variational autoencoder specifically designed to manage human pose data—and L2 regularization, the paper effectively constrains the model's outputs to remain within anatomically plausible boundaries. The L2 regularizer further ensures that the network produces more credible latent encodings.

Methodological Framework and Experiments

The research follows a two-stage pipeline: initially employing an annotation network to generate the 3D pseudo-GTs, which are later used to train a secondary estimation network. The annotation network alone is subject to the outlined 'recipes,' while the estimation network benefits indirectly through these improved 3D pseudo-GTs. Experimental outcomes demonstrate that the proposed strategies effectively elevate the performance of various 3D human mesh estimation networks. Assessments carried out on benchmarks like 3DPW and MuPoTS indicate significant reductions in errors across training scenarios employing the suggested pseudo-GTs compared to more conventional methods.

Implications and Future Perspectives

The findings have broader implications, indicating that the integration of even small but more qualitatively enriched ITW datasets—in this case, endowed with 3D ground truths—could substantially enhance the quality of 3D pseudo-GTs across the field. Furthermore, the recommended initialization and constraining strategies for network weights specifically tuned to human pose estimation could inspire similar frameworks in related domains, such as robotics and augmented reality, where accurate spatial understanding of human movements is paramount.

In conclusion, this paper not only presents potent methodologies for improving the quality of pseudo-GTs, enhancing the robustness and applicability of 3D human mesh estimation across unstructured environments, but also sets a broad precedent for future efforts within AI and machine learning disciplines to efficiently bridge the gap between 2D annotations and 3D modeling by leveraging strategic dataset compilations and novel pre-training strategies.

PDF Markdown

Related Papers

GitHub

GitHub - mks0601/NeuralAnnot_RELEASE: 3D Pseudo-GTs of "NeuralAnnot: Neural Annotator for 3D Human Mesh Training Sets", CVPRW 2022 Oral. (184 stars)