DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare (1910.00116v2)

Published 30 Sep 2019 in cs.CV, cs.LG, and eess.IV

Abstract: We present DenseRaC, a novel end-to-end framework for jointly estimating 3D human pose and body shape from a monocular RGB image. Our two-step framework takes the body pixel-to-surface correspondence map (i.e., IUV map) as proxy representation and then performs estimation of parameterized human pose and shape. Specifically, given an estimated IUV map, we develop a deep neural network optimizing 3D body reconstruction losses and further integrating a render-and-compare scheme to minimize differences between the input and the rendered output, i.e., dense body landmarks, body part masks, and adversarial priors. To boost learning, we further construct a large-scale synthetic dataset (MOCA) utilizing web-crawled Mocap sequences, 3D scans and animations. The generated data covers diversified camera views, human actions and body shapes, and is paired with full ground truth. Our model jointly learns to represent the 3D human body from hybrid datasets, mitigating the problem of unpaired training data. Our experiments show that DenseRaC obtains superior performance against state of the art on public benchmarks of various humanrelated tasks.

Citations (193)

View on Semantic Scholar

Summary

The paper presents a novel end-to-end framework that integrates dense IUV mapping and adversarial priors for joint 3D pose and shape estimation.
It employs a deep neural network with a ResNet-50 backbone and a differentiable renderer to optimize reconstruction losses using pixel-to-surface correspondences.
Experimental results on benchmark datasets like Human3.6M and MPI-INF-3DHP show improved accuracy in pose estimation, segmentation, and mesh reconstruction.

DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare

The paper "DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare" presents a comprehensive methodology for estimating 3D human pose and body shape from monocular RGB images. This work introduces DenseRaC, an end-to-end framework that utilizes dense render-and-compare techniques, integrating dense body landmarks, body part masks, and adversarial priors to optimize 3D body reconstruction losses.

Framework Overview

DenseRaC is designed to process a monocular RGB image through a two-step framework. Initially, it estimates a pixel-to-surface correspondence map (IUV map), which serves as an intermediary representation. Subsequently, a deep neural network optimizes 3D body reconstruction errors by comparing rendered images to the input, thus bridging the spatial understanding between 2D observations and 3D body configurations.

The framework is detailed as follows:

Step 1: A dense IUV map is generated, providing pixel-to-surface correspondences. This is used to infer parameterized human pose and body shape.
Step 2: The framework implements a parametric human body model that encapsulates 3D human pose and body shape, optimizing against a set of dense reconstruction losses.

Methodology

DenseRaC employs several key components to ensure accurate 3D reconstruction:

Deep Neural Network: Utilizes a generator with ResNet-50 backbone for extracting feature maps and a regressor to adjust 3D parameters iteratively.
Differentiable Renderer: Facilitates the render-and-compare operation by creating 2D projections from 3D models, allowing for gradient flow through the network architecture.
Adversarial Network: Implements constraints via a discriminator that learns plausible human pose and shape configurations using adversarial learning principles.

Dataset Augmentation with MOCA

To bolster learning capabilities, DenseRaC capitalizes on a newly constructed large-scale synthetic dataset called MOCA. This dataset synthesizes 3D human animations, covering diverse poses, body shapes, and camera views. MOCA bridges the gap between synthetic and real-world data, offering abundant paired ground truth to improve model robustness and performance.

Experimental Results

DenseRaC outperforms existing state-of-the-art methods on public benchmarks, demonstrating superior effectiveness across various human-related tasks:

3D Pose Estimation: Notably reduced mean per joint position error on benchmarks like Human3.6M and MPI-INF-3DHP.
Semantic Body Segmentation: Achieved higher accuracy and F1 scores, validating the framework's capability in replicating realistic body segmentation.
3D Reconstruction: Enhanced Mean Per Vertex Position Error (MPVPE), reflecting improved mesh-level reconstruction accuracy.

Implications and Future Work

The proposed DenseRaC framework showcases the potential for several practical applications, including surveillance, entertainment, and augmented/virtual reality (AR/VR). Its integration of dense IUV mapping with synthetic data resources promotes robust learning paradigms capable of handling occlusion, varied body shapes, and rich action dynamics.

Looking ahead, future enhancements may explore advanced handling of occlusions, interaction modeling through multi-view fusion, and temporal smoothness in dynamic scenes, thereby expanding the applicability of DenseRaC in more complex scenarios.