MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement (2408.14211v1)

Published 26 Aug 2024 in cs.CV and cs.AI

Abstract: Existing works in single-image human reconstruction suffer from weak generalizability due to insufficient training data or 3D inconsistencies for a lack of comprehensive multi-view knowledge. In this paper, we introduce MagicMan, a human-specific multi-view diffusion model designed to generate high-quality novel view images from a single reference image. As its core, we leverage a pre-trained 2D diffusion model as the generative prior for generalizability, with the parametric SMPL-X model as the 3D body prior to promote 3D awareness. To tackle the critical challenge of maintaining consistency while achieving dense multi-view generation for improved 3D human reconstruction, we first introduce hybrid multi-view attention to facilitate both efficient and thorough information interchange across different views. Additionally, we present a geometry-aware dual branch to perform concurrent generation in both RGB and normal domains, further enhancing consistency via geometry cues. Last but not least, to address ill-shaped issues arising from inaccurate SMPL-X estimation that conflicts with the reference image, we propose a novel iterative refinement strategy, which progressively optimizes SMPL-X accuracy while enhancing the quality and consistency of the generated multi-views. Extensive experimental results demonstrate that our method significantly outperforms existing approaches in both novel view synthesis and subsequent 3D human reconstruction tasks.

Authors (10)

Xu He (66 papers)
Xiaoyu Li (348 papers)
Di Kang (44 papers)
Jiangnan Ye (8 papers)
Chaopeng Zhang (7 papers)
Liyang Chen (33 papers)
Xiangjun Gao (9 papers)
Han Zhang (338 papers)
Zhiyong Wu (171 papers)
Haolin Zhuang (6 papers)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel synthesis method combining a 2D diffusion model with the SMPL-X framework to generate consistent multi-view human images.
It leverages a hybrid 1D-3D attention mechanism and a geometry-aware dual branch to ensure detailed image consistency and efficient view integration.
Iterative refinement of SMPL-X pose parameters significantly enhances 3D human reconstruction accuracy compared to state-of-the-art techniques.

MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement

MagicMan introduces a new approach in the field of human novel view generation and subsequent 3D human reconstruction by employing a generative multi-view diffusion model. Its primary objective is to generate high-quality, consistent multi-view images of humans from a single reference image. This novel method synergizes a pre-trained 2D diffusion model with the 3D parametric SMPL-X model, achieving remarkable results in both the generation and reconstruction tasks.

Core Components and Innovative Techniques

The MagicMan framework integrates several key components and introduces specific innovations that address the challenges in generating multi-view human images and facilitating consistent 3D reconstruction. Here are the core elements:

Conditional Diffusion Model:
- Utilizes a pre-trained 2D denoising UNet diffusion model (Stable Diffusion 1.5) as the generative backbone for leveraging large-scale image priors.
- Incorporates a reference UNet to extract features from the provided human image, enhancing the consistency between generated images and the reference.
- Viewpoint control is introduced through camera embeddings, while normal and segmentation maps from the SMPL-X model act as geometric guidance.
Hybrid Multi-View Attention:
- Introduces an efficient hybrid 1D-3D attention mechanism to address the memory efficiency vs. consistency trade-off.
- 1D attention establishes connections between different views efficiently by interacting across view dimensions.
- 3D attention extends spatial and view dimensions for enhanced information interchange, leveraging a sparse subset of selected views to minimize memory overhead.
Geometry-Aware Dual Branch:
- A dual-branch approach generates both RGB images and normal maps, with shared blocks ensuring feature fusion across domains.
- This technique improves geometric consistency and enhances the accuracy of generated details.
Iterative Refinement Strategy:
- Progressive optimization of the SMPL-X pose parameters through iterative feedback from the generated multi-view images.
- By iteratively improving SMPL-X accuracy, the model addresses the ill-shaped geometry issues commonly arising from inaccurate initial estimates.

Experimental Results and Evaluation

MagicMan was rigorously tested on several datasets, including THuman2.1, CustomHumans, and diverse in-the-wild images. The results illustrate the superior performance of MagicMan in generating consistent, high-quality multi-view images and reconstructing detailed 3D human meshes. Here are some key findings:

Novel View Synthesis:
- MagicMan significantly outperformed existing methods, such as Zero123, SV3D, and animation-based approaches, in terms of PSNR, SSIM, LPIPS, and CLIP scores.
- The proposed hybrid attention mechanism and geometry-aware dual branch were shown to be critical in achieving consistent multi-view generation.
3D Human Reconstruction:
- MagicMan demonstrated substantial improvements in Chamfer, P2S, and normal errors when compared to state-of-the-art reconstruction methods like PIFu, PaMIR, ICON, and ECON.
- The iterative refinement strategy effectively mitigated the ill-shaped issues, resulting in accurate and consistent geometric structures.

Implications and Future Directions

The implications of MagicMan span both practical applications and theoretical advancements in the field of computer vision and graphics. The ability to generate dense, consistent multi-view images from a single reference image can significantly enhance 3D human modeling workflows, particularly in applications such as virtual reality, gaming, and digital entertainment.

In terms of future developments, several areas of improvement and exploration are noted:

Enhanced Backbones: Utilizing more advanced diffusion models like SDXL or exploring higher-resolution models could further improve the quality of generated hands and faces.
Robust Reconstruction Techniques: Integrating techniques like SDS and image-level losses can reduce reliance on strict multi-view consistency, potentially leading to sharper textures in reconstructed meshes.
Specialized Techniques: Incorporating specialized methods for specific human body parts may address current limitations in detailed depiction.

Conclusion

MagicMan represents a sophisticated advancement in the generation of multi-view human images and 3D reconstruction, providing a high level of consistency and detail through its innovative use of diffusion models and iterative refinement. Its contributions are poised to significantly influence the field, paving the way for more accurate and efficient digital human modeling techniques. The iterative refinement strategy, in particular, stands out as a robust solution to the common problem of ill-shaped geometry in 3D human reconstruction. As research in this domain progresses, these foundational techniques will likely serve as a basis for further advancements and applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1828279462460318084

https://twitter.com/javaeeeee1/status/1830206745248952715

https://twitter.com/arXivGPT/status/1828936403776065868

https://twitter.com/javaeeeee1/status/1829859728031355231