Expressive Body Capture: 3D Hands, Face, and Body from a Single Image (1904.05866v1)

Published 11 Apr 2019 in cs.CV

Abstract: To facilitate the analysis of human actions, interactions and emotions, we compute a 3D model of human body pose, hand pose, and facial expression from a single monocular image. To achieve this, we use thousands of 3D scans to train a new, unified, 3D model of the human body, SMPL-X, that extends SMPL with fully articulated hands and an expressive face. Learning to regress the parameters of SMPL-X directly from images is challenging without paired images and 3D ground truth. Consequently, we follow the approach of SMPLify, which estimates 2D features and then optimizes model parameters to fit the features. We improve on SMPLify in several significant ways: (1) we detect 2D features corresponding to the face, hands, and feet and fit the full SMPL-X model to these; (2) we train a new neural network pose prior using a large MoCap dataset; (3) we define a new interpenetration penalty that is both fast and accurate; (4) we automatically detect gender and the appropriate body models (male, female, or neutral); (5) our PyTorch implementation achieves a speedup of more than 8x over Chumpy. We use the new method, SMPLify-X, to fit SMPL-X to both controlled images and images in the wild. We evaluate 3D accuracy on a new curated dataset comprising 100 images with pseudo ground-truth. This is a step towards automatic expressive human capture from monocular RGB data. The models, code, and data are available for research purposes at https://smpl-x.is.tue.mpg.de.

Citations (1,474)

View on Semantic Scholar

Summary

The paper presents SMPL-X, a unified 3D human modeling framework integrating detailed body, hand, and face representations from single images.
It leverages advanced techniques including a variational pose prior, efficient collision detection, and an optimized SMPLify-X pipeline for superior accuracy.
Evaluations show that SMPL-X outperforms previous methods with lower 3D joint and vertex-to-vertex errors, enabling effective real-world applications.

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

The paper, "Expressive Body Capture: 3D Hands, Face, and Body from a Single Image," presents a significant advancement in the area of holistic 3D human modeling from single RGB images. The authors introduce SMPL-X, a new model that integrates detailed representations of the human body, hands, and face in a unified framework. This addresses the limitations of earlier models which often isolated these components.

Methodological Enhancements

The model training and optimization pipeline incorporates several novel methodologies:

SMPL-X Model:
- SMPL-X extends the SMPL model by incorporating the FLAME head model and the MANO hand model. This allows for detailed and expressive representations of the body, face, and hands.
- The model's parameters include body pose, body shape, facial expressions, and hand pose, collectively analyzing 5586 3D scans to capture natural correlations between body parts.
SMPLify-X Optimization:
- The authors adopt 2D feature detection followed by model fitting, akin to the SMPLify approach. However, they introduce significant improvements:
  - Detection of 2D features for the face, hands, and feet.
  - A trained neural network pose prior using a large MoCap dataset.
  - A new interpenetration penalty that is both accurate and computationally efficient.
  - Automatic gender detection for more accurate body model selection.
  - Implementation in PyTorch, achieving an eightfold speedup over previous methods using Chumpy.
Variational Pose Prior (VPoser):
- A variational autoencoder trained on a large corpus of motion capture data provides a robust prior for body pose, penalizing implausible poses while accommodating realistic variations.
- The training involves careful formulation to ensure valid rotation matrices and prevents overfitting.
Collision Detection:
- A novel and efficient collision penalty term is introduced, which is critical for realistic body, hand, and face interactions.

Evaluation and Results

The quantitative and qualitative evaluations underscore the superior performance of SMPL-X in capturing expressive 3D representations:

Dataset:
- A new curated dataset named EHF (Expressive hands and faces) was introduced, consisting of 100 frames from the SMPL+H dataset.
- The dataset enables vertex-to-vertex (v2v) error metric evaluations, providing a stricter accuracy measure than 3D joint errors.
Performance:
- SMPL-X outperforms SMPL and SMPL+H in terms of both v2v error and 3D joint error, demonstrating that a more expressive model leads to more accurate reconstructions.
- Ablation studies highlight the contribution of different components, such as the variational body pose prior and the collision penalty, to the overall accuracy.
Real-World Applicability:
- SMPL-X fits seamlessly to in-the-wild images from multiple datasets, showcasing its robustness and practical utility.
- Comparative figures illustrate that SMPL-X offers competitive performance even when compared to models using extensive multi-camera setups.

Implications and Future Directions

The research has significant implications for both theoretical developments and practical applications in AI, computer vision, and human-computer interaction:

Practical Applications:
- Enhanced 3D human modeling facilitates better animation, virtual reality experiences, and more nuanced human-computer interactions.
- The automatic gender detection and robust optimization pipeline make SMPL-X suitable for diverse real-world settings, extending its utility across numerous industries.
Theoretical Contributions:
- The introduction of SMPL-X advances the field of holistic 3D human modeling, emphasizing the integrated capture of body, hands, and face.
- The enhancements in pose priors and collision detection create a foundation for future models to build upon.
Future Work:
- Potential advancements include curating a larger dataset of in-the-wild SMPL-X fits and developing methods to regress SMPL-X parameters directly from RGB images, further simplifying and speeding up the process.

In summary, this paper presents a comprehensive and robust approach to single image 3D human modeling, significantly advancing the state-of-the-art and opening promising avenues for future research and development.

PDF Markdown