Pixel3DMM: 3D Face Reconstruction

Updated 9 November 2025

Pixel3DMM is a framework for single-image 3D face reconstruction that uses dense per-pixel predictions (surface normals and UV coordinates) from vision transformers.
The system couples these predictions with an optimization-driven fitting of the FLAME 3D morphable model, achieving a 17% reduction in L2 error compared to baselines.
Benchmarking on extensive 3D face datasets shows enhanced robustness across expressions, occlusions, and lighting, while setting new standards for posed and neutral geometry recovery.

Pixel3DMM is a framework for single-image 3D face reconstruction leveraging per-pixel geometric cues—surface normals and UV-coordinates—predicted by vision transformers that operate in screen space. By coupling these predictions with an optimization-driven fitting of a 3D morphable model (3DMM), specifically the FLAME face model, Pixel3DMM achieves high accuracy in recovering expressive and neutral 3D facial geometry from unconstrained RGB images. The system integrates foundation model features, customized transformer heads, and a dataset pipeline involving large-scale registration to unified mesh topology, and sets a new benchmark for both quantitative and qualitative 3D face evaluation (Giebenhain et al., 1 May 2025).

1. Model Architecture

Pixel3DMM comprises two parallel vision-transformer-based networks, each initialized with pretrained DINOv2 ViT-Base weights and adapted for geometric regression. The architectural components are as follows:

Shared DINOv2 Backbone: Both networks share the first 12 transformer blocks (D=768, 12 heads, 16×16 patches), optionally fine-tuned with a reduced learning rate.
Normal Prediction Network: Maps an RGB input $I \in \mathbb{R}^{512 \times 512 \times 3}$ to a per-pixel surface normal $\hat n(u,v) \in [-1,1]^3$ .
UV-Coordinate Prediction Network: Maps the same RGB input to a per-pixel UV coordinate $\hat u(u,v) \in [0,1]^2$ .
Prediction Heads: After the shared backbone, each network has 4 additional transformer layers with multi-head self-attention and MLPs, followed by 3 transpose-convolution layers (yielding $256 \times 256$ ), and a final linear “unpatchify” layer to $512 \times 512$ output.
Supervision: Training is supervised by ground-truth surface normals and UV maps generated from registered 3D scans.

Schematic (editor’s textual description):

1
2
3

RGB image → DINOv2 Backbone (12 blocks) 
         → [4 transformer blocks + up-conv + linear]
         → [Surface Normal map or UV map]

Fine-tuning runs with Adam (batch size 40) for ~3 days on 2x48GB GPUs, where backbone learning rate is set 10× lower than for the prediction heads.

2. Geometric Cue Prediction

Given an input image, the networks output dense per-pixel estimates:

$\hat n(u,v)$ —predicted surface normal at pixel $(u,v)$ .
$\hat u(u,v)$ —predicted UV-face coordinate at pixel $(u,v)$ .

Let $M$ denote the binary face mask derived from registration. The objective functions are:

$\mathcal{L}_{normal} = \sum_{(u,v) \in M} \|\hat n(u,v) - n^{GT}(u,v)\|_2^2$

$\mathcal{L}_{uv} = \sum_{(u,v) \in M} \|\hat u(u,v) - u^{GT}(u,v)\|_2^2$

The system infers geometric structure in a dense, screen-space fashion, enabling granular 3D priors for model fitting. This design contrasts prior approaches such as Pix2face (Crispell et al., 2017), which regresses per-pixel “PNCC” (mean-face correspondence) and offsets using a modified U-Net architecture without transformer-based priors.

3. FLAME Fitting Optimization

At inference, the Pixel3DMM outputs are used to solve for FLAME 3DMM parameters via an energy-based optimization. The parameters are grouped as:

FLAME identity, expression, and pose: $\Omega_{FLAME} = \{ z_{id} \in \mathbb{R}^{300}, z_{ex} \in \mathbb{R}^{100}, \theta \in SO(3) \}$
Camera: $\Omega_{cam} = \{ R \in SO(3), t \in \mathbb{R}^3, fl \in \mathbb{R}^+, pp \in \mathbb{R}^2 \}$

For each mesh vertex $v$ with canonical UV coordinate $T^{uv}_v$ , the closest pixel $p^*_v$ in the predicted UV map is located, and reprojection consistency is enforced within a threshold $\delta_{uv}$ :

$p^*_v = \operatorname*{arg\,min}_{p \in Image} \|T^{uv}_v - \hat u(p)\|_2$

$\mathcal{L}_{uv}^{2D} = \sum_{v} 1_{\|T^{uv}_v - \hat u(p^*_v)\|<\delta_{uv}} \cdot \|p^*_v - \pi(v);\Omega_{cam},\Omega_{FLAME})\|_2$

A surface normal consistency loss compares predicted normals to those rendered from the mesh. Regularization terms enforce proximity to MICA-derived identity and penalize expression magnitude:

$\mathcal R = \lambda_{id}\|z_{id} - z_{id}^{MICA}\|_2^2 + \lambda_{ex}\|z_{ex}\|_2^2$

The total fitting energy is:

$E(\Omega_{FLAME},\Omega_{cam}) = \lambda_{uv} \mathcal{L}_{uv}^{2D} + \lambda_n \mathcal{L}_n + \mathcal R$

Optimization runs for 500 steps with Adam and takes ≈30 s per image.

4. Data Registration and Training Datasets

Training and evaluation rely on large-scale, high-fidelity 3D face datasets registered to the unified FLAME mesh (with dense correspondence):

Dataset	Identities	Expressions	Views/Cameras	RGB/Normal/UV Pairs
NPHM	470	23	40	376,000 triplets
FaceScape	350	20	50	350,000 triplets
Ava256	video	50 (FPS sampling)	20	250,000 RGB/UV (no normals)

All training examples undergo random lighting synthesis (point lights, IC-Light relighting), random background, and randomized camera intrinsics, exposing the network to a wide appearance distribution.

5. Benchmarking and Evaluation

Pixel3DMM introduces a new benchmark based on NeRSemble multi-view video scans, including:

Subjects: 21 identities
Expressions: 420 expressive frames (20 per identity)
Neutral: 21 high-resolution neutral scans (COLMAP-reconstructed)

Tasks:

(a) Posed Geometry Reconstruction: Given an expressive/posed image, recover expression and identity geometry.

(b) Neutral Geometry Recovery: Given a posed image, reconstruct subject's neutral mesh.

$\begin{array}{lll} \text{Metric} & \text{Pixel3DMM (posed)} & \text{Best Baseline (posed)} \ L_2 \text{(mm)} & 1.11 & 1.33 \ NC\ (\text{cosine}) & 0.884 & 0.879 \ R^{2.5}\ (\text{Recall}) & 0.916 & 0.879 \ \end{array}$

$\begin{array}{lll} \text{Metric} & \text{Pixel3DMM (neutral)} & \text{Best Baseline (neutral)} \ L_2 \text{(mm)} & 1.12 & 1.14 \ NC\ (\text{cosine}) & 0.883 & — \ R^{2.5}\ (\text{Recall}) & 0.912 & — \ \end{array}$

The system achieves a 17% reduction in $L_2$ error for posed meshes over previous approaches.

6. Strengths, Limitations, and Prospects

Strengths:

Dense per-pixel normal and UV predictions markedly improve 3DMM fitting robustness, especially under extreme facial expressions, occlusion, and non-neutral lighting.
Generalization to in-the-wild imagery is bolstered by both architectural and data diversity.
Benchmark supports comprehensive analysis: it is the first to evaluate posed and neutral geometry simultaneously.

Limitations:

Expression and identity disentanglement in the optimization can be confounded, resulting in modest performance gains on the neutral recovery task.
Priors are derived from a single image; multi-view or temporal correlations are not currently exploited.
The method is not real-time. Each inference requires test-time fitting (∼30s per image), precluding feed-forward applications.

Future Work (as stated):

Distillation of geometric priors into a direct FLAME regressor for speed.
Extension to multi-view and video-based architectures.
More advanced disentanglement energies between identity and expression.

A plausible implication is that Pixel3DMM’s screen-space prior strategy may inspire similar hybrid frameworks integrating dense geometric learning with parametric model-based optimization, particularly as high-capacity transformers and large annotated 3D datasets continue to expand the practical frontier of single-view 3D perception.

PDF Markdown Chat (Pro)

References (2)

Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction (2025)

Pix2face: Direct 3D Face Model Estimation (2017)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Pixel3DMM.