Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction (2505.00615v1)

Published 1 May 2025 in cs.CV and cs.AI

Abstract: We address the 3D reconstruction of human faces from a single RGB image. To this end, we propose Pixel3DMM, a set of highly-generalized vision transformers which predict per-pixel geometric cues in order to constrain the optimization of a 3D morphable face model (3DMM). We exploit the latent features of the DINO foundation model, and introduce a tailored surface normal and uv-coordinate prediction head. We train our model by registering three high-quality 3D face datasets against the FLAME mesh topology, which results in a total of over 1,000 identities and 976K images. For 3D face reconstruction, we propose a FLAME fitting opitmization that solves for the 3DMM parameters from the uv-coordinate and normal estimates. To evaluate our method, we introduce a new benchmark for single-image face reconstruction, which features high diversity facial expressions, viewing angles, and ethnicities. Crucially, our benchmark is the first to evaluate both posed and neutral facial geometry. Ultimately, our method outperforms the most competitive baselines by over 15% in terms of geometric accuracy for posed facial expressions.

Summary

Overview of Pixel3DMM for Single-Image 3D Face Reconstruction

The paper "Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction" addresses the intricate problem of reconstructing 3D human faces from single RGB images. This problem is notably under-constrained, characterized by ambiguities in depth perception, and challenges like disambiguating identity, expression information, lighting conditions, and occlusions.

Pixel3DMM introduces a novel optimization-based framework powered by foundation models, specifically vision transformers (ViTs), to predict per-pixel geometric cues that constrain a 3D morphable model (3DMM) optimization process. The synergy of exploiting latent features of the DINO foundation model and the development of specialized prediction heads for surface normal and UV-coordinate prediction is central to this approach. These predictions facilitate the fitting of a FLAME model through a proposed optimization method, with the fitting results evaluated in a new benchmark that incorporates diverse facial expressions, viewing angles, and ethnicities.

Strong Numerical Results

The effectiveness of Pixel3DMM is demonstrated through its performance against competitive baseline models. Specifically, the method achieves over 15% improvement in geometric accuracy for posed facial expressions, highlighting its robustness in handling conditions that typically complicate face reconstruction.

Novel Contributions and Bold Claims

Key contributions of the paper include the introduction of powerful geometric cue predictions using foundational model features, the proposition of a novel approach for 3D face reconstruction utilizing uv-map correspondences and surface normals and the establishment of a new benchmark for 3D face reconstruction involving high-fidelity multi-view face captures. Importantly, this benchmark allows for simultaneous assessment of posed and neutral facial geometry, fostering direct comparison of methods concerning fitting fidelity and their ability to disentangle expression and identity.

Practical and Theoretical Implications

The practical implications of this work are significant across domains like computer games, movie production, telecommunication, and AR/VR applications—all of which can leverage this method for enhanced facial animation and tracking capabilities from single images. The proposed method reduces reliance on multi-image setups or extensive video sequences, making it more adaptable to existing image datasets.

Theoretically, Pixel3DMM advances the understanding of exploiting generalized features from foundation models in the constrained domain of 3D face reconstruction, positing new avenues for integrating such models with traditional 3DMM optimization strategies.

Future Directions

While Pixel3DMM presents notable advancements, future developments could include extending the feed-forward network approaches to take advantage of multi-view or temporal domains for better robustness in facial expressions. Another potential direction could involve distilling these predictions into a singular, consolidated model for faster reconstruction tasks, using large-scale datasets like LAION-Face. Furthermore, enhancing the disentanglement between identity and expression parameters within optimization-based approaches could refine the outcomes in neutral geometry predictions.

In conclusion, Pixel3DMM represents a vital step forward in leveraging contemporary foundation models towards intricate tasks like single-image 3D face reconstruction, setting the stage for continued advancements in both theoretical frameworks and practical applications. Researchers in the field of computer vision and AI can draw valuable insights from its approach, particularly in employing pre-trained transformers for nuanced geometrical predictions.

Related Papers

Tweets

https://twitter.com/taziku_co/status/1918270767596486682

YouTube

Show All Videos