ICON: Implicit Clothed humans Obtained from Normals (2112.09127v2)

Published 16 Dec 2021 in cs.CV, cs.AI, and cs.GR

Abstract: Current methods for learning realistic and animatable 3D clothed avatars need either posed 3D scans or 2D images with carefully controlled user poses. In contrast, our goal is to learn an avatar from only 2D images of people in unconstrained poses. Given a set of images, our method estimates a detailed 3D surface from each image and then combines these into an animatable avatar. Implicit functions are well suited to the first task, as they can capture details like hair and clothes. Current methods, however, are not robust to varied human poses and often produce 3D surfaces with broken or disembodied limbs, missing details, or non-human shapes. The problem is that these methods use global feature encoders that are sensitive to global pose. To address this, we propose ICON ("Implicit Clothed humans Obtained from Normals"), which, instead, uses local features. ICON has two main modules, both of which exploit the SMPL(-X) body model. First, ICON infers detailed clothed-human normals (front/back) conditioned on the SMPL(-X) normals. Second, a visibility-aware implicit surface regressor produces an iso-surface of a human occupancy field. Importantly, at inference time, a feedback loop alternates between refining the SMPL(-X) mesh using the inferred clothed normals and then refining the normals. Given multiple reconstructed frames of a subject in varied poses, we use SCANimate to produce an animatable avatar from them. Evaluation on the AGORA and CAPE datasets shows that ICON outperforms the state of the art in reconstruction, even with heavily limited training data. Additionally, it is much more robust to out-of-distribution samples, e.g., in-the-wild poses/images and out-of-frame cropping. ICON takes a step towards robust 3D clothed human reconstruction from in-the-wild images. This enables creating avatars directly from video with personalized and natural pose-dependent cloth deformation.

Citations (249)

View on Semantic Scholar

Summary

The paper introduces a dual-module approach that fuses SMPL(-X)-guided normal prediction with local feature-based implicit surface reconstruction.
It demonstrates superior generalization on AGORA and CAPE datasets, achieving comparable results with only 12% of training data.
ICON enables scalable creation of animatable 3D avatars from single images, with significant applications in VR, AR, and digital content creation.

Insights into "ICON: Robust 3D Clothed Human Reconstruction from In-the-Wild Images"

The paper presents ICON, a methodology for reconstructing 3D representations of clothed humans from single RGB images. This task is particularly complex due to the need to accurately capture the detailed geometry of clothing and human form under varying poses and occlusions.

Background and Motivation

Traditionally, creating animatable 3D avatars requires either 3D scans or controlled 2D imagery, which is not scalable. Existing parametric models capture basic human form but lack clothing detail and flexibility for varied poses. Implicit functions have been explored to capture finer details but face challenges in robustness across diverse, real-world scenarios. ICON addresses these limitations by integrating implicit-function-based methods with body model priors, specifically, the SMPL(-X) model.

Methodological Innovations

ICON comprises two main modules:

Normal Prediction: ICON employs a dual-component approach wherein a SMPL(-X) mesh guides the prediction of clothed-body normals. This process uses the SMPL(-X) normals to predict the front and back normals of the clothed human, effectively addressing challenges posed by occlusions.
Local Feature-Based Implicit Surface Reconstruction: Traditional methods relying on global features are sensitive to pose variations. ICON improves upon this by using local features, derived from SMPL(-X), independent of the global pose, to estimate the occupancy of the 3D surface accurately. This approach enhances robustness, especially for out-of-distribution body poses.

Evaluation and Performance

ICON's efficacy was demonstrated through evaluations on the AGORA and CAPE datasets. Notably, ICON exhibited superior generalization to in-the-wild poses, outperforming both baseline models and existing state-of-the-art methods. The quantitative metrics—Chamfer distance, P2S distance, and normals' difference—highlight ICON's capacity to handle complex cases with higher accuracy. It generalizes better to real-world scenarios, showing significant resilience to out-of-frame cropping effects and scale disparities, which had stalled progress in previous work.

Importantly, ICON maintains performance with significantly less training data, a key advantage in machine learning that demands extensive data for robust model training. Even with only 12% of the training data, ICON achieved results comparable to full dataset utilization.

Practical and Theoretical Implications

ICON facilitates the conversion of simple video frames into comprehensive 3D avatars that animate accurately with clothing deformation. This ability can significantly impact virtual reality (VR), augmented reality (AR), and the burgeoning development of a digital "metaverse." By providing a scalable, cost-effective alternative to 3D scanning technology, ICON can drive innovations across sectors requiring virtual human models, including entertainment, remote training, education, and digital content creation.

Theoretically, ICON's approach provides a roadmap for further integration of parametric models with deep learning methodologies, emphasizing the benefits of local feature extraction over global encoding processes.

Limitations and Future Directions

While ICON overcomes many existing roadblocks, it has limitations, particularly in dealing with high variations in clothing distances from the body, such as skirts or loose garments. Moreover, extreme discrepancies between the estimated and actual SMPL(-X) fits can lead to significant errors.

Future research could explore the extension of ICON's capabilities to more complex garment types and further improve the refinement loop for SMPL(-X) optimization. Additionally, creating datasets featuring diverse clothing types and poses would bolster the model's generalization capabilities.

ICON represents a critical advancement in 3D human reconstruction with potential long-lasting impacts on the fields of computer vision and graphics, alongside practical applications far beyond the scope covered in this work. As the technology matures, it will allow broader accessibility and usability of 3D modeling, transcending current technological and data limitations.

PDF Markdown

Related Papers

YouTube

Show All Videos