FaceLift: Single Image to 3D Head with View Generation and GS-LRM (2412.17812v1)

Published 23 Dec 2024 in cs.CV and cs.GR

Abstract: We present FaceLift, a feed-forward approach for rapid, high-quality, 360-degree head reconstruction from a single image. Our pipeline begins by employing a multi-view latent diffusion model that generates consistent side and back views of the head from a single facial input. These generated views then serve as input to a GS-LRM reconstructor, which produces a comprehensive 3D representation using Gaussian splats. To train our system, we develop a dataset of multi-view renderings using synthetic 3D human head as-sets. The diffusion-based multi-view generator is trained exclusively on synthetic head images, while the GS-LRM reconstructor undergoes initial training on Objaverse followed by fine-tuning on synthetic head data. FaceLift excels at preserving identity and maintaining view consistency across views. Despite being trained solely on synthetic data, FaceLift demonstrates remarkable generalization to real-world images. Through extensive qualitative and quantitative evaluations, we show that FaceLift outperforms state-of-the-art methods in 3D head reconstruction, highlighting its practical applicability and robust performance on real-world images. In addition to single image reconstruction, FaceLift supports video inputs for 4D novel view synthesis and seamlessly integrates with 2D reanimation techniques to enable 3D facial animation. Project page: https://weijielyu.github.io/FaceLift.

Summary

The paper presents a two-stage approach utilizing diffusion models and GS-LRM to generate detailed 3D head reconstructions from a single image.
The methodology ensures identity preservation and view consistency by generating six virtual viewpoints from a frontal image.
Evaluations on synthetic and real datasets show superior performance in metrics like PSNR, SSIM, and LPIPS compared to previous techniques.

Overview of the "FaceLift: Single Image to 3D Head with View Generation and GS-LRM" Paper

The paper "FaceLift: Single Image to 3D Head with View Generation and GS-LRM" addresses the longstanding challenge of reconstructing high-fidelity 3D human heads from a single image, a significant problem in the fields of computer vision and graphics. Traditional techniques for 3D head synthesis typically depend on parametric models derived from extensive 3D scan datasets, which often result in outputs lacking fine geometric details. Recent advancements in generative models, such as GANs and diffusion models, have opened new avenues for developing more complex and detailed 3D representations without needing large, structured datasets.

Methodology

FaceLift operates through a two-stage pipeline that introduces a novel approach by utilizing a diffusion model-based multi-view generation strategy coupled with a large reconstruction model (GS-LRM). The methodology can be succinctly broken down into two key phases:

Multi-view Generation using Diffusion Models: The first phase utilizes a diffusion model to generate consistent side and back views of the human head from a single input image. This model employs a latent diffusion process, conditioned by an image from the frontal view, to produce six virtual viewpoints. This multi-view setup facilitates robust identity preservation and helps ensure view consistency across angles.
3D Reconstruction via GS-LRM: In the second phase, generated views serve as inputs to a reconstructor - GS-LRM, which translates these views into a 3D head representation using Gaussian splats. This large reconstruction model enables the capturing of detailed facial geometry and texture, leveraging strong priors learned from synthetic datasets. The authors also emphasize the importance of training GS-LRM on both synthetic datasets and large datasets such as Objaverse to enhance its realism and applicability to real-world images.

Evaluation and Results

FaceLift is evaluated using both synthetic and real-world datasets, demonstrating robustness and significantly better performance than previous state-of-the-art approaches. The quantitative analysis across metrics such as PSNR, SSIM, LPIPS, and identity preservation using ArcFace confirms its superior efficacy in producing consistent and high-fidelity 3D reconstructions. Qualitative comparisons also show that FaceLift effectively handles complex facial features and varying light conditions better than traditional methods like PanoHead and other reconstruction models.

Implications and Future Prospects

The implications of FaceLift are substantial for applications in augmented reality, digital entertainment, and telepresence. The ability to generate a photorealistic 3D head model from a single image not only simplifies data acquisition but also enhances the realism and interactivity of virtual content. The integration of video inputs for 4D novel view synthesis further positions FaceLift as a potentially transformative technology for real-time applications.

However, there are limitations, notably in handling objects not present in the training dataset, such as hats or glasses, which may result in erroneous reconstructions. Future work could address these issues by extending the diversity and compositionality of the training data. Moreover, exploring how text-prompted generation can enrich the detail and accuracy in outputs could be a beneficial direction.

Overall, FaceLift sets a benchmark for single-image-based 3D head reconstruction by effectively combining state-of-the-art diffusion-based multi-view generation with advanced reconstruction modeling, heralding a future of more detailed and easily accessible 3D head modeling that is responsive to diverse images.

PDF Markdown

Related Papers

GitHub

FaceLift: Single Image to 3D Head with View Generation and GS-LRM

Tweets

https://twitter.com/janusch_patas/status/1871856167925735503

https://twitter.com/ryo694/status/1880596351416939001

YouTube

Show All Videos