Human-3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models (2406.08475v2)

Published 12 Jun 2024 in cs.CV

Abstract: Creating realistic avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot provide multi-view shape priors with guaranteed 3D consistency. We propose Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion. Our key insight is that 2D multi-view diffusion and 3D reconstruction models provide complementary information for each other, and by coupling them in a tight manner, we can fully leverage the potential of both models. We introduce a novel image-conditioned generative 3D Gaussian Splats reconstruction model that leverages the priors from 2D multi-view diffusion models, and provides an explicit 3D representation, which further guides the 2D reverse sampling process to have better 3D consistency. Experiments show that our proposed framework outperforms state-of-the-art methods and enables the creation of realistic avatars from a single RGB image, achieving high-fidelity in both geometry and appearance. Extensive ablations also validate the efficacy of our design, (1) multi-view 2D priors conditioning in generative 3D reconstruction and (2) consistency refinement of sampling trajectory via the explicit 3D representation. Our code and models will be released on https://yuxuan-xue.com/human-3diffusion.

Authors (4)

Yuxuan Xue (7 papers)
Xianghui Xie (8 papers)
Riccardo Marin (25 papers)
Gerard Pons-Moll (81 papers)

Citations (1)

View on Semantic Scholar

Summary

The paper proposes a generative 3D-GS model leveraging 2D multi-view diffusion priors for high-fidelity avatar reconstruction.
It integrates 2D diffusion with 3D reconstruction through an iterative refinement process to ensure robust multi-view consistency.
Experimental results demonstrate state-of-the-art performance in both geometry and appearance across diverse datasets.

Overview of Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

The paper entitled "Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models" presents a rigorous approach for generating realistic avatars from single RGB images. The authors address a significant challenge in computer vision and graphics by introducing a novel method that integrates 2D multi-view diffusion models with 3D reconstruction techniques to ensure 3D consistency.

Key Contributions

The main contributions of the paper can be summarized as follows:

Generative 3D-GS Model: The authors propose a generative model for 3D Gaussian Splats (3D-GS) that leverages 2D multi-view diffusion priors, ensuring realistic avatar creation that maintains high-fidelity geometry and texture.
3D Consistent Diffusion: By coupling 2D multi-view diffusion models with 3D reconstruction models, the approach guarantees 3D consistency across multi-view renderings. This addresses the issue of 3D inconsistencies arising in traditional 2D diffusion frameworks.
Iterative Refinement Process: The paper introduces a sophisticated diffusion method that iteratively refines the sampling trajectory using a 3D representation, harmonizing the reverse sampling process to improve the multi-view consistency.
State-of-the-Art Performance: Experiments demonstrate that the proposed framework surpasses existing state-of-the-art methods in creating realistic avatars from single images, achieving substantial improvements in both geometric and appearance metrics.

Detailed Methodology

Generative 3D-GS Reconstruction

The authors present a novel 3D-GS generation model, g_\phi, which is conditioned on the input context image to perform 3D reconstruction. This model does not diffuse directly in the space of the 3D representation but rather operates on multi-view 2D renditions of the 3D-GS. Leveraging the priors from 2D multi-view diffusion models (ε_θ), the model iteratively reconstructs a 3D representation that maintains multi-view consistency.

Joint Training and Consistent Sampling

The training process involves a joint optimization of the 2D diffusion model and the 3D generative model by coupling them at each timestep of the diffusion process. This is achieved by enhancing 3D-GS generation with priors from estimated clean multi-views \mathbf{x}^\text{tgt}_0 at each step. The reverse sampling process is refined iteratively using 3D-GS renderings, which helps maintain the 3D consistency across the generated views, ultimately resulting in coherent 3D avatars.

Experimental Results

The authors employ several datasets for training and evaluation, including customized high-quality human scans and publicly available datasets like Sizer and IIIT, ensuring diverse and comprehensive testing. The performance is quantitatively measured using metrics like Chamfer Distance (CD), F-score, Normal Consistency (NC), SSIM, LPIPS, PSNR, and FID.

Comparative Analysis

Experiments indicate that Human 3Diffusion outperforms existing methods including SiTH, PIFu, and LGM among others, in both geometry and appearance metrics. The method shows strong generalization capabilities, producing high-quality avatars even for out-of-distribution scenarios such as traditional Indian suits from the IIIT dataset.

Ablation Studies

Ablative studies underscore the importance of the 2D multi-view priors and the iterative refinement process. The authors show that removing the multi-view priors or the refinement process leads to a noticeable degradation in the quality of the generated avatars, reinforcing the validity of their methodological choices.

Implications and Future Directions

Practical Applications

The proposed method holds significant potential for practical applications in augmented and virtual reality, gaming, and film, where rapid and realistic avatar creation from minimal input is crucial. The strong performance in generating accurate textures and maintaining 3D consistency suggests that this method can enhance user experience and create new opportunities for interactive media.

Theoretical Implications

Theoretically, the paper's approach of coupling 2D and 3D diffusion models to ensure multi-view consistency can be extended to other domains where 3D reconstruction from 2D inputs is required. This coupling methodology opens avenues for future research in improving diffusion techniques and integrating them with 3D generative models.

Conclusion

The paper "Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models" presents a forward-thinking approach to avatar creation, emphasizing the integration of 2D diffusion priors with 3D generative models to ensure consistency and high fidelity. The rigorous experimental evaluation, comprehensive ablation studies, and significant practical implications highlight the robustness and applicability of the proposed method. Future advancements in higher resolution multi-view diffusion models and handling complex poses are anticipated to further enhance the method’s applicability and performance.

PDF Markdown

Related Papers

Tweets

https://twitter.com/yxue_yxue/status/1801189904170246237

https://twitter.com/cvml_mpiinf/status/1866751743058735516