Learning to Reconstruct People in Clothing from a Single RGB Camera (1903.05885v2)

Published 14 Mar 2019 in cs.CV

Abstract: We present a learning-based model to infer the personalized 3D shape of people from a few frames (1-8) of a monocular video in which the person is moving, in less than 10 seconds with a reconstruction accuracy of 5mm. Our model learns to predict the parameters of a statistical body model and instance displacements that add clothing and hair to the shape. The model achieves fast and accurate predictions based on two key design choices. First, by predicting shape in a canonical T-pose space, the network learns to encode the images of the person into pose-invariant latent codes, where the information is fused. Second, based on the observation that feed-forward predictions are fast but do not always align with the input images, we predict using both, bottom-up and top-down streams (one per view) allowing information to flow in both directions. Learning relies only on synthetic 3D data. Once learned, the model can take a variable number of frames as input, and is able to reconstruct shapes even from a single image with an accuracy of 6mm. Results on 3 different datasets demonstrate the efficacy and accuracy of our approach.

Citations (315)

View on Semantic Scholar

Summary

The paper presents Octopus, a model that reconstructs 3D human shape with clothing and hair from 1–8 monocular RGB frames while achieving 4–5 mm accuracy in under 10 seconds.
It leverages both bottom-up and top-down processing with canonical T-pose encoding to produce pose-invariant and personalized reconstructions.
Trained solely on synthetic 3D data, the framework democratizes 3D avatar creation for applications in VR, AR, gaming, and cinematography.

Learning to Reconstruct People in Clothing from a Single RGB Camera

The paper presents a comprehensive paper on Octopus, a learning-based framework capable of reconstructing the 3D shape of individuals, including personal clothing and hair, from a limited number of frames (1-8) captured via a monocular RGB camera. The authors focus on reducing the computational complexity and time associated with previous methodologies, achieving reconstruction accuracy in the range of 4 to 5 mm in less than 10 seconds. This advancement has significant implications for applications spanning virtual reality, augmented reality, gaming, and cinematography.

A key advantage of the Octopus model is its ability to predict personalized shape details, such as clothing and hair, by leveraging canonical T-pose space. This ensures the encoding of images into pose-invariant representations, facilitating consistency across varying frames. The integration of both bottom-up and top-down processing streams enhances the adaptability of the model—allowing initial fast predictions followed by refinement processes for increased accuracy and congruence with the input data.

Octopus is trained exclusively on synthetic 3D data, negating the need for real images paired with ground truth 3D annotations, which are challenging to obtain. The model's inference can adapt to either a single or multiple images (up to 8), preserving an accuracy of 5 mm even when constrained to a single image input. The authors demonstrate the effectiveness of this approach across an array of datasets, showcasing its robustness and applicability.

The methodology leverages the SMPL (Skinned Multi-Person Linear) model for representing undressed human body shapes and introduces vertex offsets to model additional details such as clothing. By efficiently encoding semantic information and using convolutional neural networks, Octopus generates shape predictions that are subsequently optimized based on silhouette overlap and joint re-projection error. This dual-stage refinement process ensures congruence between the theoretical model and real-world observations.

Practically, this paper moves towards democratizing 3D avatar creation by providing a rapid and minimally intrusive method to acquire personalized digital models. The validation on real-world datasets like LifeScans and the PeopleSnapshot substantiates the model's generalizability and aptitude for application outside synthetic environments.

Speculatively, the advances made by Octopus could serve as a foundation for future endeavors within AI that prioritize speed and personalization in non-cooperative scenarios, such as extracting models from legacy media or ad-hoc environments. Future trajectories could also explore the extension of the framework to handle more complex attire and diverse human poses not captured in controlled conditions.

In conclusion, the Octopus model introduces a significant stride in reducing the barriers to entry for 3D body reconstruction, fostering broader implications across numerous fields that rely on accurate human modeling. This contributes to the growing repertoire of tools enabling personalized digital experiences.

PDF Markdown

Learning to Reconstruct People in Clothing from a Single RGB Camera (1903.05885v2)

Summary

Learning to Reconstruct People in Clothing from a Single RGB Camera

Related Papers