- The paper presents a shape-aware multi-view framework for 3D human pose and shape reconstruction, improving accuracy with direct shape supervision.
- The framework uses a large synthetic dataset with ground-truth shapes, enabling accurate shape recovery under garment occlusion in real images.
- The multi-view method improves pose estimation and achieves better shape recovery on real-world images compared to single-view baselines like HMR.
The paper addresses the problem of 3D human body reconstruction from multi-view images, leveraging the Skinned Multi-Person Linear (SMPL) model to represent pose and shape. The core idea is that using multiple views reduces projection ambiguity and improves reconstruction accuracy, especially for clothed humans.
Here's a breakdown:
- The paper introduces a learning-based, shape-aware framework for human body mesh reconstruction that uses SMPL parameters for pose and shape estimation and is directly supervised on shape parameters.
- A scalable, end-to-end, multi-view multi-stage learning framework accounts for the ambiguity of the 3D human body (geometry) reconstruction problem from 2D images, achieving improved estimation results.
- A large simulated dataset, including clothed human bodies and the corresponding ground-truth parameters, enhances the reconstruction accuracy, especially in shape estimation, where no ground-truth or supervision is provided in the real-world dataset.
- Accurate shape recovery under garment occlusion is achieved by providing corresponding supervision and deepening the model using the multi-view framework.
The paper uses a multi-view multi-stage network structure to capture visual features on garments, which are considered important for shape estimation. The model uses the SMPL model, representing human body shapes and poses with Principal Component Analysis (PCA) coefficients.
To train the model, the authors generate a synthetic dataset of multi-view human motion sequences with varying poses, shapes, and clothes using physically-based simulation. The use of this dataset makes the model "shape-aware," capturing the statistical correlation between garment features and human body shapes.
Model Architecture:
- The network iteratively refines pose and shape estimates through multiple stages, processing multi-view images one at a time.
- A shared-parameter prediction block computes corrections based on image features and previous estimates.
- Camera and human body parameters are estimated simultaneously, and predicted 3D joints are projected back to 2D for loss computation.
- Pose and shape parameters are shared among views, while each view maintains its camera calibration and global rotation.
- The loss function combines 2D joint loss, 3D joint loss, and SMPL parameter loss:
Li=λ0L2Djoint+λ1L3Djoint+LSMPL
where:
- Li is the loss at stage i
- L2Djoint is the 2D joint loss
- L3Djoint is the 3D joint loss
- LSMPL is the SMPL parameter loss
- λ0 and λ1 are weighting factors.
- The SMPL model generates a human body mesh based on pose and shape parameters:
X(θ,β)=WG(θ)(X0+Sβ+PR(θ))
where:
- X are the computed vertices
- θ are the rotations of each joint plus the global rotation
- β are the PCA coefficients
- W,S and P are trained matrices
- G(θ) is the global transformation
- X0 are the mean body vertices
- R(θ) is the relative rotation matrix.
- The 3D body is projected back to 2D using orthogonal projection:
x=sX(θ,β)RT+t
where:
- x are the 2D projected vertices
- s is the scale
- R is the orthogonal projection matrix
- t is the translation.
The framework uses a recurrent structure, making it applicable to any number of views. A multi-view multi-stage framework couples multiple image inputs with shared parameters across regression blocks. The shared information is expressed as the predicted human body parameter. The corrective values, instead of the updated parameters, are predicted at each regression block to prevent gradient vanishing, inspired by residual networks.
Data Preparation:
- The authors generate a synthetic dataset with ground-truth human body shapes where garments are dressed rather than pasted on the skin.
- The CMU MoCap dataset is used as the pose subspace, and shape parameters are uniformly sampled.
- An optimization scheme avoids inter-penetration in the generated human mesh.
- Cloth registration and simulation are performed using ArcSim, with random sampling of cloth tightness.
Results:
The model was evaluated on the Human3.6M and MPI_INF_3DHP datasets and compared against the HMR method. The results showed that the multi-view approach achieves higher accuracy, especially in shape estimation.
The authors use Mean Per Joint Position Error (MPJPE), Percentage of Correct Keypoints (PCK), and Area Under the Curve (AUC) as metrics for pose estimation.
They use the Hausdorff distance between meshes to capture shape differences.
The paper finds that multi-view input improves pose estimation accuracy significantly. The synthetic training data improves shape estimation without overfitting. It is demonstrated that the model performs better on real-world images, capturing body shape more accurately than the baseline method.
A key advantage of the method is that multi-view input does not need to be taken with the exact same pose. The error correction structure of the model allows it to be applied as long as the poses of the views are not significantly different.