Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction (2309.13524v3)

Published 24 Sep 2023 in cs.CV and cs.AI

Abstract: Reconstructing 3D clothed human avatars from single images is a challenging task, especially when encountering complex poses and loose clothing. Current methods exhibit limitations in performance, largely attributable to their dependence on insufficient 2D image features and inconsistent query methods. Owing to this, we present the Global-correlated 3D-decoupling Transformer for clothed Avatar reconstruction (GTA), a novel transformer-based architecture that reconstructs clothed human avatars from monocular images. Our approach leverages transformer architectures by utilizing a Vision Transformer model as an encoder for capturing global-correlated image features. Subsequently, our innovative 3D-decoupling decoder employs cross-attention to decouple tri-plane features, using learnable embeddings as queries for cross-plane generation. To effectively enhance feature fusion with the tri-plane 3D feature and human body prior, we propose a hybrid prior fusion strategy combining spatial and prior-enhanced queries, leveraging the benefits of spatial localization and human body prior knowledge. Comprehensive experiments on CAPE and THuman2.0 datasets illustrate that our method outperforms state-of-the-art approaches in both geometry and texture reconstruction, exhibiting high robustness to challenging poses and loose clothing, and producing higher-resolution textures. Codes will be available at https://github.com/River-Zhang/GTA.

Citations (14)

View on Semantic Scholar

Summary

The paper introduces a novel 3D-decoupling transformer that reconstructs clothed avatars from single images.
It leverages a Vision Transformer encoder and a 3D-decoupling decoder with cross-attention to extract global features and convert 2D data into detailed 3D embeddings.
Experimental results showing reduced Chamfer distances and improved PSNR underscore its superior geometric and texture reconstruction for real-world applications.

Overview of the Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction

This essay analyzes a transformative approach to reconstructing three-dimensional (3D) clothed avatars from single images, an important challenge due to complications introduced by varied poses and clothing styles. The specified paper presents a novel model, Global-correlated 3D-decoupling Transformer for clothed Avatar (GTA) reconstruction, which advances previous methods in accuracy and reliability.

Model Architecture and Components

GTA stands out by using multiple key components:

Vision Transformer Encoder: At the core of GTA is the Vision Transformer encoder, which comprehensively captures global-correlated features from a monocular input image. This architecture capitalizes on the transformer's ability to model long-range dependencies, contributing to a more complete and context-aware initial feature extraction.
3D-decoupling Decoder: The model features an innovative decoder that utilizes cross-attention mechanisms to decouple the image into tri-plane features. This is achieved by learning embeddings that guide the transformation of 2D image data into detailed 3D planes, enabling a robust characterization of complex human shapes.
Hybrid Prior Fusion Strategy: An outstanding addition to the reconstruction process is the fusion strategy that synergizes spatial and prior-enhanced queries. This strategy enables the blending of human body prior knowledge with direct spatial information, enhancing the model's performance in accurately predicting both the geometry and texture of the avatar.

Experimental Evaluation

The methods were evaluated on several datasets including the CAPE and THuman2.0, showing superior performance relative to state-of-the-art models in reconstruction tasks. Notably, GTA achieved a significant reduction in Chamfer distance, indicating high geometric accuracy. Additionally, it excelled in texture reconstruction, as reflected by higher Peak Signal-to-Noise Ratio (PSNR) scores.

Implications and Future Directions

The research exhibits both practical and theoretical implications. Practically, GTA paves the way for applications such as virtual reality (VR), augmented reality (AR), and digital fashion by enabling high-fidelity avatar reconstruction from single images. Theoretically, it suggests new directions for exploiting transformer architectures in computer vision, particularly for tasks necessitating comprehensive feature representation from limited input data.

Challenges and Considerations

Despite its advancements, the model's reliance on pretrained body priors can introduce inaccuracies if initial body estimates are incorrect. The approach can struggle with unconventional clothing that greatly deviates from the body shape, indicating areas for future improvement.

Conclusion

In conclusion, the Global-correlated 3D-decoupling Transformer represents a substantive contribution to the field of single-image 3D human reconstruction, combining innovative model architecture with effective feature fusion strategies to achieve superior results. Continued exploration of transformer-based approaches for varied vision tasks is likely to yield further insights and improvements in both efficiency and performance.

PDF Markdown

Related Papers

GitHub

GitHub - River-Zhang/GTA: [NeurIPS 23] Official repository for NeurIPS 2023 paper "Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction" (106 stars)