BodyNet: Volumetric Inference of 3D Human Body Shapes (1804.04875v3)

Published 13 Apr 2018 in cs.CV

Abstract: Human shape estimation is an important task for video editing, animation and fashion industry. Predicting 3D human body shape from natural images, however, is highly challenging due to factors such as variation in human bodies, clothing and viewpoint. Prior methods addressing this problem typically attempt to fit parametric body models with certain priors on pose and shape. In this work we argue for an alternative representation and propose BodyNet, a neural network for direct inference of volumetric body shape from a single image. BodyNet is an end-to-end trainable network that benefits from (i) a volumetric 3D loss, (ii) a multi-view re-projection loss, and (iii) intermediate supervision of 2D pose, 2D body part segmentation, and 3D pose. Each of them results in performance improvement as demonstrated by our experiments. To evaluate the method, we fit the SMPL model to our network output and show state-of-the-art results on the SURREAL and Unite the People datasets, outperforming recent approaches. Besides achieving state-of-the-art performance, our method also enables volumetric body-part segmentation.

Authors (7)

Gül Varol (39 papers)
Duygu Ceylan (63 papers)
Bryan Russell (36 papers)
Jimei Yang (58 papers)
Ersin Yumer (34 papers)
Ivan Laptev (99 papers)
Cordelia Schmid (206 papers)

Citations (420)

View on Semantic Scholar

Summary

The paper introduces a novel voxel-based volumetric inference method that predicts 3D occupancy grids for detailed human shape reconstruction.
The paper employs integrated multi-task learning by leveraging 2D pose, body part segmentation, and 3D cues to enhance prediction accuracy.
The paper demonstrates state-of-the-art performance on SURREAL and Unite the People datasets, enabling applications in video editing, animation, and AR.

Overview of "BodyNet: Volumetric Inference of 3D Human Body Shapes"

This paper introduces BodyNet, a novel approach for estimating 3D human body shapes from a single image using volumetric inference through a neural network architecture. The paper addresses the challenges of 3D human shape estimation, which are pertinent to applications in video editing, animation, and fashion. Contrary to conventional methods that rely on parametric body models with predefined parameters, BodyNet employs a voxel-based representation, offering a richer and more flexible inference of a person's shape.

Key Contributions

Volumetric Representation: BodyNet eschews the direct regression of body model parameters in favor of predicting a 3D occupancy grid that represents the shape of a person. This representation captures more detailed shape information, albeit computationally intensive, allowing for multi-hypotheses scenarios.
Integrated Multi-Task Learning: The network architecture is designed in a modular fashion where intermediate tasks such as 2D pose estimation, 2D body part segmentation, and 3D pose estimation are used to guide the 3D shape prediction. This multi-task learning approach improves overall performance by leveraging auxiliary predictions.
End-to-End Training with Re-Projection Losses: BodyNet incorporates a novel multi-view re-projection loss to enforce boundary constraints on predicted shapes. This loss function is instrumental in improving the precision of the predicted limb boundaries, which are typically more challenging to estimate.
Evaluation and Results: The approach demonstrates state-of-the-art performance on datasets such as SURREAL and Unite the People (UP), achieving lower surface error rates compared to existing methods. Furthermore, BodyNet is extendable to provide volumetric body-part segmentation, showcasing its versatility.

Experimental Findings

The authors conduct extensive evaluations of BodyNet, demonstrating its effectiveness across several dimensions:

Quantitative Performance: The method surpasses prior techniques, such as those relying on SMPL model fitting (e.g., SMPLify++) and direct regression of shape parameters.
Modular Performance Improvements: Individual components like re-projection losses and multi-task learning consistently enhance the predictive accuracy, as shown through measured improvements on various test subsets.
Versatile Deployment: By producing realistic volumetric part segmentations, BodyNet is positioned well for integration with applications requiring detailed human body models.

Implications and Future Directions

The implications of this work are multifaceted:

Practical Applications: With its superior precision and full 3D body reconstruction capability, BodyNet is applicable in virtual try-ons, cinematic editing, and augmented reality.
Architecture Flexibility: The approach allows integration with complex tasks that require 3D understanding of human figures, paving pathways to developments in interactive systems that require dynamic human modeling.
Potential Extensions: Future studies could delve further into capturing shape under varied clothing or incorporating dynamic elements to understand the underlying body shapes during motion.

Overall, BodyNet's approach to volumetric inference signals a significant advancement in non-rigid object modeling, notably bridging the gap between 2D image inputs and comprehensive 3D object outputs. This work establishes a robust foundation for leveraging neural architectures to tackle challenging 3D inference tasks, fostering developments in both academic and industrial domains concerning computer vision.

PDF Markdown

Related Papers

YouTube

Show All Videos