Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image (1607.08128v1)

Published 27 Jul 2016 in cs.CV

Abstract: We describe the first method to automatically estimate the 3D pose of the human body as well as its 3D shape from a single unconstrained image. We estimate a full 3D mesh and show that 2D joints alone carry a surprising amount of information about body shape. The problem is challenging because of the complexity of the human body, articulation, occlusion, clothing, lighting, and the inherent ambiguity in inferring 3D from 2D. To solve this, we first use a recently published CNN-based method, DeepCut, to predict (bottom-up) the 2D body joint locations. We then fit (top-down) a recently published statistical body shape model, called SMPL, to the 2D joints. We do so by minimizing an objective function that penalizes the error between the projected 3D model joints and detected 2D joints. Because SMPL captures correlations in human shape across the population, we are able to robustly fit it to very little data. We further leverage the 3D model to prevent solutions that cause interpenetration. We evaluate our method, SMPLify, on the Leeds Sports, HumanEva, and Human3.6M datasets, showing superior pose accuracy with respect to the state of the art.

Citations (1,504)

View on Semantic Scholar

Summary

The paper presents SMPLify, an automatic method that estimates 3D human pose and shape from a single image.
It combines CNN-based 2D joint detections with a statistical 3D body model to address challenges like occlusion and articulation.
Evaluations on datasets such as HumanEva and Human3.6M highlight its superior accuracy and robust real-world performance.

Introduction

The paper introduces a novel method that automatically estimates both the 3D pose and shape of a human body from a single image, overcoming challenges frequently encountered in this domain, such as body articulation, occlusion, and the complexity of interpreting 3D data from 2D sources. The technique capitalizes on a combination of cutting-edge CNN-based 2D joint location prediction and a 3D generative body model, demonstrating an advancement in the field.

Methodology

The process involves two stages: initially, 2D joint locations are identified using a method called DeepCut, a CNN-based technique. Next, these 2D joints aid in fitting a 3D statistical body model known as SMPL to the image. This top-down approach blends the strengths of robust 2D detection with a 3D model that encodes the statistical correlation of human body shape variations, leading to credible 3D human form and pose deductions. An objective function assesses the accuracy of the fit by penalizing the error between the projected model joints and the detected 2D joints.

Enhancements and Evaluations

The paper emphasizes the use of a generative model to address interpenetration, a common issue with 3D body estimations from 2D data, leading to physically implausible poses where body parts unnaturally intersect. To mitigate this, the researchers introduce a differentiable interpenetration term that approximates body segments using a set of capsules efficiently.

Extensive evaluations on synthetic data show the algorithm's proficiency in inferring 3D shape from 2D joints, even with noise present. Real-world benchmarks on datasets such as HumanEva and Human3.6M highlight the method's superiority in pose accuracy compared to existing state-of-the-art methods. The approach demonstrates robustness on challenging image datasets like the Leeds Sports Pose Dataset, indicating its practical applicability.

Conclusions and Potential

The method, named SMPLify, presents a significant advancement by providing a fully automatic technique for deducing human form and posture from an ordinary image. It exhibits how a comprehensive body model can powerfully guide pose reconstruction from minimal data. Notable for its speed and effectiveness, SMPLify's outcomes are encouraging and suggest numerous possibilities for enhancement and application, from advanced image analysis and virtual reality to ergonomic design and health-related research. The research closes by acknowledging potential areas for extending the method, including the integration of additional image cues and the development of a facial pose detector.

PDF Markdown