PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing (2409.10141v2)

Published 16 Sep 2024 in cs.CV

Abstract: Detailed and photorealistic 3D human modeling is essential for various applications and has seen tremendous progress. However, full-body reconstruction from a monocular RGB image remains challenging due to the ill-posed nature of the problem and sophisticated clothing topology with self-occlusions. In this paper, we propose PSHuman, a novel framework that explicitly reconstructs human meshes utilizing priors from the multiview diffusion model. It is found that directly applying multiview diffusion on single-view human images leads to severe geometric distortions, especially on generated faces. To address it, we propose a cross-scale diffusion that models the joint probability distribution of global full-body shape and local facial characteristics, enabling detailed and identity-preserved novel-view generation without any geometric distortion. Moreover, to enhance cross-view body shape consistency of varied human poses, we condition the generative model on parametric models like SMPL-X, which provide body priors and prevent unnatural views inconsistent with human anatomy. Leveraging the generated multi-view normal and color images, we present SMPLX-initialized explicit human carving to recover realistic textured human meshes efficiently. Extensive experimental results and quantitative evaluations on CAPE and THuman2.1 datasets demonstrate PSHumans superiority in geometry details, texture fidelity, and generalization capability.

Summary

The paper introduces PSHuman, a novel diffusion-based framework that reconstructs detailed 3D human models from a single image while addressing geometric distortions.
It employs cross-scale diffusion and SMPL-X conditioned multi-view generation to preserve facial and body details, achieving reconstructions in about one minute.
Quantitative tests on CAPE and THuman2.1 datasets confirm significant improvements in geometry accuracy, texture fidelity, and overall robustness compared to previous methods.

Photorealistic Human Reconstruction via Cross-Scale Diffusion

The paper describes PSHuman, a novel framework for reconstructing highly detailed, photorealistic 3D human models from a single image. This is achieved by leveraging a multi-view diffusion-based approach. The authors propose an innovative method that addresses some of the primary challenges found in single-view human reconstruction, namely, the realistic generation of 3D human geometry and texture without geometric distortions, especially in complex poses and clothing topologies.

Pioneering in the domain of photorealistic modeling, PSHuman integrates priors from the multi-view diffusion models to reconstruct human meshes. A key innovation in this research is the body-face cross-scale diffusion accompanied by SMPL-X conditioned multi-view diffusion. These elements collectively work to preserve local features like facial characteristics while ensuring that the full-body shape remains consistent and free from distortion across different views.

Quantitative evaluations on the CAPE and THuman2.1 datasets reveal PSHuman's superior performance in terms of geometric detail, texture fidelity, and generalization capability compared to existing methods. Experimental results showcase improvements in detailed geometry rendering and texture fidelity, notably in the representation of facial features and fabric textures. The framework is capable of operating with impressive efficiency, reconstructing models in approximately one minute, contrasting the multi-hour optimizations required by some alternative state-of-the-art methods.

Technical Insights and Performance

PSHuman is based on diffusion models, particularly fine-tuning pretrained text-to-image models to facilitate multi-view generation. The pipeline consists of several stages; notably, an SMPLX-initialized explicit human carving module synthesizes high-fidelity textured 3D human meshes. Empirical evidence indicates that PSHuman achieves exceptional performance in full-body human reconstructions, even under varying poses and occlusions.

The paper provides a detailed comparison of past methodologies ranging from implicit function-based reconstruction, explicit shape-based approaches, to recent diffusions-based methods, indicating significant strides in tackling existing limitations. An illustrative sequence of ablation studies ascertains the importance of each technical component in achieving the overall robustness and fidelity of the proposed method.

Implications and Future Research

The practical implications of such an advanced reconstruction framework are numerous. Potential applications span industries from fashion and gaming to film and virtual/augmented reality, where precise and realistic human models are essential. In terms of theoretical implications, the research exemplifies a significant step forward in leveraging diffusion models for complex, occluded image generation.

Future developments could focus on mitigating the error propagation from pose estimation, enhance robustness to in-the-wild scenarios, and integrate more comprehensive datasets to bolster the performance further. The adoption of neural networks in rendering such detailed 3D structures might also be extended, enriching the capability of the framework across diverse environments and applications.

In conclusion, the PSHuman framework presents a significant advancement in the field of photorealistic 3D human reconstruction. By combining cross-scale diffusion models with SMPL-X conditioning, the paper sets a new standard for efficiency and realism in single-image modeling techniques. These contributions not only address persistent challenges in the field but also open new avenues for research and application in real-world scenarios.

PDF Markdown

Related Papers

Tweets

https://twitter.com/YuanLiu41955461/status/1863149385170973135