TeCH: Text-guided Reconstruction of Lifelike Clothed Humans (2308.08545v2)

Published 16 Aug 2023 in cs.CV, cs.AI, and cs.GR

Abstract: Despite recent research advancements in reconstructing clothed humans from a single image, accurately restoring the "unseen regions" with high-level details remains an unsolved challenge that lacks attention. Existing methods often generate overly smooth back-side surfaces with a blurry texture. But how to effectively capture all visual attributes of an individual from a single image, which are sufficient to reconstruct unseen areas (e.g., the back view)? Motivated by the power of foundation models, TeCH reconstructs the 3D human by leveraging 1) descriptive text prompts (e.g., garments, colors, hairstyles) which are automatically generated via a garment parsing model and Visual Question Answering (VQA), 2) a personalized fine-tuned Text-to-Image diffusion model (T2I) which learns the "indescribable" appearance. To represent high-resolution 3D clothed humans at an affordable cost, we propose a hybrid 3D representation based on DMTet, which consists of an explicit body shape grid and an implicit distance field. Guided by the descriptive prompts + personalized T2I diffusion model, the geometry and texture of the 3D humans are optimized through multi-view Score Distillation Sampling (SDS) and reconstruction losses based on the original observation. TeCH produces high-fidelity 3D clothed humans with consistent & delicate texture, and detailed full-body geometry. Quantitative and qualitative experiments demonstrate that TeCH outperforms the state-of-the-art methods in terms of reconstruction accuracy and rendering quality. The code will be publicly available for research purposes at https://huangyangyi.github.io/TeCH

PDF Abstract

Text-guided Reconstruction of Lifelike Clothed Humans: An Overview

The paper Text-guided Reconstruction of Lifelike Clothed Humans presents a novel approach for reconstructing 3D human figures with detailed geometry and high-resolution textures from a single input image. Unlike traditional methods that struggle with unseen regions or produce overly smooth and blurry reconstructions, the authors leverage advanced foundation models, including a personalized Text-to-Image (T2I) diffusion model and visual question answering (VQA), to guide the reconstruction process.

Key Contributions

The major contributions of this paper can be summarized as follows:

Hybrid 3D Representation: The authors propose a hybrid 3D representation based on DMTet, consisting of an explicit body shape grid and an implicit distance field. This allows for high-resolution detail at a manageable computational cost.
Text-guided Reconstruction: The method employs descriptive text prompts derived from visual question answering (VQA) to capture specific attributes of the human figure, such as garment types, colors, and facial features. These prompts guide the personalized diffusion model during the reconstruction.
Multi-stage Optimization: A multi-stage optimization strategy is employed, which includes a geometry stage and a texture stage. Each stage uses Score Distillation Sampling (SDS) and various reconstruction losses to refine the model.

Methodology

The proposed method is divided into several key steps:

Text Guidance Extraction: Using a fine-tuned SegFormer model, the system parses the input image to detect items of clothing and other attributes. These attributes are converted into descriptive text prompts using a VQA model.
Personalized T2I Diffusion Model: A DreamBooth T2I diffusion model is fine-tuned on a few augmented images of the subject to create a unique token that encapsulates the individual's specific appearance. This token is combined with descriptive text to form the final prompt for reconstruction.
3D Representation: A hybrid 3D representation is used, which combines an explicit body shape grid with an implicit distance field. This hybrid approach optimizes both the overall body structure and fine details.
Optimization Stages: The optimization is performed in two stages. The geometry is first refined through silhouette loss, SDS loss on normal images, and geometric regularization. The texture is then optimized by applying SDS loss on color images and using a color consistency loss to ensure harmonized textures across visible and non-visible areas.

Experimental Results

The authors conducted both quantitative and qualitative evaluations to demonstrate the effectiveness of their method. They utilized datasets such as CAPE and THuman2.0 for geometric accuracy and texture quality comparison. Metrics like Chamfer distance, point-to-surface distance (P2S), and normal error were used for 3D evaluation, whereas PSNR, SSIM, and LPIPS were used for assessing texture quality.

The results indicate that their method outperforms existing state-of-the-art techniques in both geometric accuracy and texture quality. Quantitative metrics show notable improvements in both 3D and 2D image quality. Additionally, qualitative comparisons and a perceptual paper confirm the superior realism and consistency of the outputs generated by the proposed method.

Implications and Future Work

The practical implications of this research are significant for applications in augmented and virtual reality, gaming, social media, and more. The ability to reconstruct lifelike, highly-detailed 3D humans from a single image provides new opportunities for creating immersive experiences and personalized digital avatars.

From a theoretical perspective, the integration of text guidance and personalized diffusion models represents a novel fusion of natural language processing and computer vision techniques for 3D reconstruction. This approach can be extended to other reconstruction tasks beyond human figures, opening new research directions in multimodal learning.

Future work could focus on addressing limitations such as handling extremely loose clothing and improving the robustness of pose estimation. Leveraging controllable T2I models and investigating compositional generation of separate body components could lead to further advancements. Additionally, improvements in computational efficiency would make this method more accessible for real-time applications.

Conclusion

Text-guided Reconstruction of Lifelike Clothed Humans presents a significant advancement in the field of 3D reconstruction. By combining VQA-based text guidance with a personalized T2I diffusion model, the authors offer a method that not only reconstructs detailed full-body geometry but also generates high-quality textures, even in unseen regions. This innovative approach sets a new benchmark for future research and practical applications in creating lifelike digital humans.