End-to-end Recovery of Human Shape and Pose (1712.06584v2)

Published 18 Dec 2017 in cs.CV

Abstract: We describe Human Mesh Recovery (HMR), an end-to-end framework for reconstructing a full 3D mesh of a human body from a single RGB image. In contrast to most current methods that compute 2D or 3D joint locations, we produce a richer and more useful mesh representation that is parameterized by shape and 3D joint angles. The main objective is to minimize the reprojection loss of keypoints, which allow our model to be trained using images in-the-wild that only have ground truth 2D annotations. However, the reprojection loss alone leaves the model highly under constrained. In this work we address this problem by introducing an adversary trained to tell whether a human body parameter is real or not using a large database of 3D human meshes. We show that HMR can be trained with and without using any paired 2D-to-3D supervision. We do not rely on intermediate 2D keypoint detections and infer 3D pose and shape parameters directly from image pixels. Our model runs in real-time given a bounding box containing the person. We demonstrate our approach on various images in-the-wild and out-perform previous optimization based methods that output 3D meshes and show competitive results on tasks such as 3D joint location estimation and part segmentation.

Authors (4)

Angjoo Kanazawa (84 papers)
Michael J. Black (163 papers)
David W. Jacobs (19 papers)
Jitendra Malik (211 papers)

Citations (1,691)

View on Semantic Scholar

Summary

End-to-end Recovery of Human Shape and Pose

The paper "End-to-end Recovery of Human Shape and Pose" by Kanazawa et al. introduces a comprehensive framework named Human Mesh Recovery (HMR) for predicting a detailed 3D mesh of human bodies from single RGB images. This work advances the state-of-the-art in human pose and shape estimation by leveraging a generative model for human bodies (SMPL) and adversarial learning techniques to infer detailed 3D human body parameters without relying on intermediate 2D joint detection stages.

Abstract

HMR is an end-to-end system designed to reconstruct full 3D human body meshes from single RGB images. Traditional methods focusing solely on recovering 2D or 3D joint locations fall short in capturing the complete structure and pose of the human body due to their sparse nature. Instead, HMR provides a rich, parameterized mesh representation using 3D joint angles and shape parameters, aiming to minimize keypoint reprojection loss. The novelty lies in using adversarial learning to address the under-constrained nature of 2D-to-3D reprojection, training the model with both paired and unpaired annotations to enforce anthropometric validity even without direct 3D supervision.

Introduction

HMR generates a comprehensive 3D mesh output that encapsulates the full human body structure, addressing limitations of traditional methods that only predict sparse 3D joint locations. By utilizing the SMPL generative model, which represents both global shape and local pose details through a low-dimensional parameter space, HMR ensures a high level of detail, suitable for applications requiring accurate body modeling, such as animation and human-computer interaction. The framework's holistic output also remains robust against occlusion and truncation, an advantage visualized through qualitative results on in-the-wild images.

Contribution and Methodology

HMR makes several key contributions:

Direct Mesh Reconstruction: In contrast to multi-stage methods, HMR directly infers SMPL parameters from image data through an iterative regression approach, avoiding the intermediate step of 2D keypoint detection and thus preserving critical image information.
Adversarial Training: To handle the high-dimensional and underconstrained nature of 3D inference, HMR employs adversarial training using a discriminator network based on large-scale datasets of 3D human meshes. This enforces realistic body shapes and poses even when trained on 2D annotations alone.
End-to-end Training: HMR allows seamless integration of diverse data sources, leveraging a blend of supervised and weakly-supervised learning. This results in robust performance across various contexts and improved generalizability to real-world images.

The iterative regression mechanism in HMR effectively narrows down the parameter search space, progressively refining predictions. Direct regression of rotation matrices ensures valid limb symmetry and proper joint angles, mitigating issues faced by previous classification-based approaches. The encoder and discriminator networks are jointly optimized, with the discriminator providing a strong anthropometric prior, significantly enhancing the realism of the inferred 3D bodies.

Numerical Results

Experimental results on standard benchmarks such as Human3.6M and MPI-INF-3DHP demonstrate HMR's competitive performance in 3D joint location estimation. The system markedly outperforms previous methods, particularly those dependent on intermediate keypoint detection stages. Notably, the model’s ability to produce effective 3D reconstructions without any paired 3D supervision highlights its adaptability and robustness.

Human3.6M: HMR achieves mean per joint position errors (MPJPE) of 87.97mm and reconstruction errors of 58.1mm, outclassing approaches like SMPLify and others that estimate SMPL parameters.
MPI-INF-3DHP: On this more diverse dataset, HMR maintains strong performance with 72.9% PCK and 36.5% AUC scores, even improving after rigid alignment.

Theoretical and Practical Implications

From a theoretical perspective, the paper underscores the efficacy of adversarial learning in ensuring realistic 3D human body modeling without direct 3D supervision. This approach, effectively using non-synthetic datasets for training, sets a precedent for future works in leveraging large-scale 2D annotations for 3D inference tasks.

Practically, HMR’s ability to generate detailed 3D meshes directly from images opens up new possibilities in animation, augmented reality, virtual try-ons, and ergonomic studies. The method’s real-time processing capability, given an initial bounding box, ensures practicality for various real-world applications requiring instantaneous 3D human modeling.

Speculation on Future Developments

Future research may expand on optimizing the adversarial training framework, possibly incorporating more sophisticated priors reflecting dynamic human motion. Integrating temporal consistency could also improve the robustness of HMR in video data, bridging the gap between static image reconstruction and dynamic scene understanding.

Overall, the HMR framework by Kanazawa et al. signifies a notable step forward in holistic, detailed human body modeling, providing a versatile foundation for subsequent advancements in 3D vision and AI-driven human pose estimation.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos