Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Pixel3DMM: 3D Face Reconstruction

Updated 9 November 2025
  • Pixel3DMM is a framework for single-image 3D face reconstruction that uses dense per-pixel predictions (surface normals and UV coordinates) from vision transformers.
  • The system couples these predictions with an optimization-driven fitting of the FLAME 3D morphable model, achieving a 17% reduction in L2 error compared to baselines.
  • Benchmarking on extensive 3D face datasets shows enhanced robustness across expressions, occlusions, and lighting, while setting new standards for posed and neutral geometry recovery.

Pixel3DMM is a framework for single-image 3D face reconstruction leveraging per-pixel geometric cues—surface normals and UV-coordinates—predicted by vision transformers that operate in screen space. By coupling these predictions with an optimization-driven fitting of a 3D morphable model (3DMM), specifically the FLAME face model, Pixel3DMM achieves high accuracy in recovering expressive and neutral 3D facial geometry from unconstrained RGB images. The system integrates foundation model features, customized transformer heads, and a dataset pipeline involving large-scale registration to unified mesh topology, and sets a new benchmark for both quantitative and qualitative 3D face evaluation (Giebenhain et al., 1 May 2025).

1. Model Architecture

Pixel3DMM comprises two parallel vision-transformer-based networks, each initialized with pretrained DINOv2 ViT-Base weights and adapted for geometric regression. The architectural components are as follows:

  • Shared DINOv2 Backbone: Both networks share the first 12 transformer blocks (D=768, 12 heads, 16×16 patches), optionally fine-tuned with a reduced learning rate.
  • Normal Prediction Network: Maps an RGB input IR512×512×3I \in \mathbb{R}^{512 \times 512 \times 3} to a per-pixel surface normal n^(u,v)[1,1]3\hat n(u,v) \in [-1,1]^3.
  • UV-Coordinate Prediction Network: Maps the same RGB input to a per-pixel UV coordinate u^(u,v)[0,1]2\hat u(u,v) \in [0,1]^2.
  • Prediction Heads: After the shared backbone, each network has 4 additional transformer layers with multi-head self-attention and MLPs, followed by 3 transpose-convolution layers (yielding 256×256256 \times 256), and a final linear “unpatchify” layer to 512×512512 \times 512 output.
  • Supervision: Training is supervised by ground-truth surface normals and UV maps generated from registered 3D scans.

Schematic (editor’s textual description):

1
2
3
RGB image → DINOv2 Backbone (12 blocks) 
         → [4 transformer blocks + up-conv + linear]
         → [Surface Normal map or UV map]

Fine-tuning runs with Adam (batch size 40) for ~3 days on 2x48GB GPUs, where backbone learning rate is set 10× lower than for the prediction heads.

2. Geometric Cue Prediction

Given an input image, the networks output dense per-pixel estimates:

  • n^(u,v)\hat n(u,v)—predicted surface normal at pixel (u,v)(u,v).
  • u^(u,v)\hat u(u,v)—predicted UV-face coordinate at pixel (u,v)(u,v).

Let MM denote the binary face mask derived from registration. The objective functions are:

Lnormal=(u,v)Mn^(u,v)nGT(u,v)22\mathcal{L}_{normal} = \sum_{(u,v) \in M} \|\hat n(u,v) - n^{GT}(u,v)\|_2^2

Luv=(u,v)Mu^(u,v)uGT(u,v)22\mathcal{L}_{uv} = \sum_{(u,v) \in M} \|\hat u(u,v) - u^{GT}(u,v)\|_2^2

The system infers geometric structure in a dense, screen-space fashion, enabling granular 3D priors for model fitting. This design contrasts prior approaches such as Pix2face (Crispell et al., 2017), which regresses per-pixel “PNCC” (mean-face correspondence) and offsets using a modified U-Net architecture without transformer-based priors.

3. FLAME Fitting Optimization

At inference, the Pixel3DMM outputs are used to solve for FLAME 3DMM parameters via an energy-based optimization. The parameters are grouped as:

  • FLAME identity, expression, and pose: ΩFLAME={zidR300,zexR100,θSO(3)}\Omega_{FLAME} = \{ z_{id} \in \mathbb{R}^{300}, z_{ex} \in \mathbb{R}^{100}, \theta \in SO(3) \}
  • Camera: Ωcam={RSO(3),tR3,flR+,ppR2}\Omega_{cam} = \{ R \in SO(3), t \in \mathbb{R}^3, fl \in \mathbb{R}^+, pp \in \mathbb{R}^2 \}

For each mesh vertex vv with canonical UV coordinate TvuvT^{uv}_v, the closest pixel pvp^*_v in the predicted UV map is located, and reprojection consistency is enforced within a threshold δuv\delta_{uv}:

pv=arg minpImageTvuvu^(p)2p^*_v = \operatorname*{arg\,min}_{p \in Image} \|T^{uv}_v - \hat u(p)\|_2

Luv2D=v1Tvuvu^(pv)<δuvpvπ(v);Ωcam,ΩFLAME)2\mathcal{L}_{uv}^{2D} = \sum_{v} 1_{\|T^{uv}_v - \hat u(p^*_v)\|<\delta_{uv}} \cdot \|p^*_v - \pi(v);\Omega_{cam},\Omega_{FLAME})\|_2

A surface normal consistency loss compares predicted normals to those rendered from the mesh. Regularization terms enforce proximity to MICA-derived identity and penalize expression magnitude:

R=λidzidzidMICA22+λexzex22\mathcal R = \lambda_{id}\|z_{id} - z_{id}^{MICA}\|_2^2 + \lambda_{ex}\|z_{ex}\|_2^2

The total fitting energy is:

E(ΩFLAME,Ωcam)=λuvLuv2D+λnLn+RE(\Omega_{FLAME},\Omega_{cam}) = \lambda_{uv} \mathcal{L}_{uv}^{2D} + \lambda_n \mathcal{L}_n + \mathcal R

Optimization runs for 500 steps with Adam and takes ≈30 s per image.

4. Data Registration and Training Datasets

Training and evaluation rely on large-scale, high-fidelity 3D face datasets registered to the unified FLAME mesh (with dense correspondence):

Dataset Identities Expressions Views/Cameras RGB/Normal/UV Pairs
NPHM 470 23 40 376,000 triplets
FaceScape 350 20 50 350,000 triplets
Ava256 video 50 (FPS sampling) 20 250,000 RGB/UV (no normals)

All training examples undergo random lighting synthesis (point lights, IC-Light relighting), random background, and randomized camera intrinsics, exposing the network to a wide appearance distribution.

5. Benchmarking and Evaluation

Pixel3DMM introduces a new benchmark based on NeRSemble multi-view video scans, including:

  • Subjects: 21 identities
  • Expressions: 420 expressive frames (20 per identity)
  • Neutral: 21 high-resolution neutral scans (COLMAP-reconstructed)

Tasks:

(a) Posed Geometry Reconstruction: Given an expressive/posed image, recover expression and identity geometry.

(b) Neutral Geometry Recovery: Given a posed image, reconstruct subject's neutral mesh.

MetricPixel3DMM (posed)Best Baseline (posed) L2(mm)1.111.33 NC (cosine)0.8840.879 R2.5 (Recall)0.9160.879 \begin{array}{lll} \text{Metric} & \text{Pixel3DMM (posed)} & \text{Best Baseline (posed)} \ L_2 \text{(mm)} & 1.11 & 1.33 \ NC\ (\text{cosine}) & 0.884 & 0.879 \ R^{2.5}\ (\text{Recall}) & 0.916 & 0.879 \ \end{array}

MetricPixel3DMM (neutral)Best Baseline (neutral) L2(mm)1.121.14 NC (cosine)0.883— R2.5 (Recall)0.912— \begin{array}{lll} \text{Metric} & \text{Pixel3DMM (neutral)} & \text{Best Baseline (neutral)} \ L_2 \text{(mm)} & 1.12 & 1.14 \ NC\ (\text{cosine}) & 0.883 & — \ R^{2.5}\ (\text{Recall}) & 0.912 & — \ \end{array}

The system achieves a 17% reduction in L2L_2 error for posed meshes over previous approaches.

6. Strengths, Limitations, and Prospects

Strengths:

  • Dense per-pixel normal and UV predictions markedly improve 3DMM fitting robustness, especially under extreme facial expressions, occlusion, and non-neutral lighting.
  • Generalization to in-the-wild imagery is bolstered by both architectural and data diversity.
  • Benchmark supports comprehensive analysis: it is the first to evaluate posed and neutral geometry simultaneously.

Limitations:

  • Expression and identity disentanglement in the optimization can be confounded, resulting in modest performance gains on the neutral recovery task.
  • Priors are derived from a single image; multi-view or temporal correlations are not currently exploited.
  • The method is not real-time. Each inference requires test-time fitting (∼30s per image), precluding feed-forward applications.

Future Work (as stated):

  • Distillation of geometric priors into a direct FLAME regressor for speed.
  • Extension to multi-view and video-based architectures.
  • More advanced disentanglement energies between identity and expression.

A plausible implication is that Pixel3DMM’s screen-space prior strategy may inspire similar hybrid frameworks integrating dense geometric learning with parametric model-based optimization, particularly as high-capacity transformers and large annotated 3D datasets continue to expand the practical frontier of single-view 3D perception.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Pixel3DMM.