Video Based Reconstruction of 3D People Models (1803.04758v3)

Published 13 Mar 2018 in cs.CV

Abstract: This paper describes how to obtain accurate 3D body models and texture of arbitrary people from a single, monocular video in which a person is moving. Based on a parametric body model, we present a robust processing pipeline achieving 3D model fits with 5mm accuracy also for clothed people. Our main contribution is a method to nonrigidly deform the silhouette cones corresponding to the dynamic human silhouettes, resulting in a visual hull in a common reference frame that enables surface reconstruction. This enables efficient estimation of a consensus 3D shape, texture and implanted animation skeleton based on a large number of frames. We present evaluation results for a number of test subjects and analyze overall performance. Requiring only a smartphone or webcam, our method enables everyone to create their own fully animatable digital double, e.g., for social VR applications or virtual try-on for online fashion shopping.

Citations (444)

View on Semantic Scholar

Summary

The paper introduces a robust pipeline that reconstructs accurate, animation-ready 3D human models from a single monocular video.
It employs an enhanced SMPL model with vertex offsets and silhouette unposing to ensure detailed shape and texture estimation.
Empirical results demonstrate an average reconstruction error of 4.5 mm, opening new opportunities in VR, biometric analysis, and e-commerce.

Video-Based Reconstruction of 3D People Models: A Technical Overview

The paper "Video Based Reconstruction of 3D People Models" presents a novel method to construct detailed 3D human body models from single, monocular video sequences. This research introduces a robust pipeline that enables the generation of accurate 3D models capturing personal details, textures, and animation-ready skeletons. The authors achieve a reconstruction accuracy of 4.5 mm, a notable accomplishment given the inherent challenges of monocular video data and dynamic motion.

Methodology

The proposed approach employs a parametric body model, specifically the SMPL (Skinned Multi-Person Linear) model, enhanced by additional vertex offsets to account for personal details and clothing variations. The core innovation lies in transforming dynamic body poses to a canonical frame of reference, allowing the method to construct a visual hull from video silhouettes effectively. This transformation operates by unposing silhouette cones in a common reference frame, facilitating efficient shape and texture estimation.

The method comprises three main steps:

Pose Reconstruction: Utilizes the SMPL model to estimate 3D poses for each frame, optimized by fitting to 2D detections and incorporating silhouette data for enhanced accuracy and temporal coherence.
Consensus Shape Estimation: Involves unposing silhouette cones, which places constraints on the body shape in a canonical T-pose. This novel approach allows a single, comprehensive optimization of body shape and personalized surface details across the video sequence.
Texture Generation and Frame Refinement: Focuses on capturing temporal variations and generating coherent textures. The refined shapes allow for high-quality texture mapping from multiple frames.

The paper also addresses the challenges in free-form versus model-based reconstruction methods, emphasizing the combination of parametric constraints with free-form surface optimization to achieve high fidelity results from limited input data.

Results and Evaluation

The effectiveness of the method is demonstrated across multiple datasets, including BUFF and DynamicFAUST, which provide ground truth 3D scans under varied clothing conditions. The numerical results show an average reconstruction error of 4.5 mm, achieving even higher accuracy with known ground truth poses (3.1 mm). The ability to perform such detailed reconstruction using only RGB data from a single camera marks a significant technical achievement.

Comparisons with existing methods, like KinectCap that utilize RGB-D data, highlight the method's competitive performance despite the reduced data constraints. In practical terms, this means the approach significantly lowers the barrier for 3D model generation, requiring only a standard RGB camera.

Implications and Future Directions

The implications of this research are vast, impacting fields that range from virtual and augmented reality applications to online retail and surveillance. By enabling the creation of fully animatable digital doubles from simple video inputs, the method opens new prospects for personalized VR experiences, biometric analysis, and virtual try-on scenarios in e-commerce.

Future developments could focus on refining the model's capability to handle extreme variations in clothing and hair, as well as enhancing handling of concave regions not easily captured by current silhouette methods. There is potential to integrate lighting and material estimations to enable more realistic rendering and video enhancements.

This comprehensive approach to video-based 3D reconstruction not only presents a technical advancement in computer graphics and vision but also paves the way for widespread accessibility and applications of personalized 3D modeling.

PDF Markdown

Related Papers

YouTube

Show All Videos