Pixel3DMM: 3D Face Reconstruction
- Pixel3DMM is a framework for single-image 3D face reconstruction that uses dense per-pixel predictions (surface normals and UV coordinates) from vision transformers.
- The system couples these predictions with an optimization-driven fitting of the FLAME 3D morphable model, achieving a 17% reduction in L2 error compared to baselines.
- Benchmarking on extensive 3D face datasets shows enhanced robustness across expressions, occlusions, and lighting, while setting new standards for posed and neutral geometry recovery.
Pixel3DMM is a framework for single-image 3D face reconstruction leveraging per-pixel geometric cues—surface normals and UV-coordinates—predicted by vision transformers that operate in screen space. By coupling these predictions with an optimization-driven fitting of a 3D morphable model (3DMM), specifically the FLAME face model, Pixel3DMM achieves high accuracy in recovering expressive and neutral 3D facial geometry from unconstrained RGB images. The system integrates foundation model features, customized transformer heads, and a dataset pipeline involving large-scale registration to unified mesh topology, and sets a new benchmark for both quantitative and qualitative 3D face evaluation (Giebenhain et al., 1 May 2025).
1. Model Architecture
Pixel3DMM comprises two parallel vision-transformer-based networks, each initialized with pretrained DINOv2 ViT-Base weights and adapted for geometric regression. The architectural components are as follows:
- Shared DINOv2 Backbone: Both networks share the first 12 transformer blocks (D=768, 12 heads, 16×16 patches), optionally fine-tuned with a reduced learning rate.
- Normal Prediction Network: Maps an RGB input to a per-pixel surface normal .
- UV-Coordinate Prediction Network: Maps the same RGB input to a per-pixel UV coordinate .
- Prediction Heads: After the shared backbone, each network has 4 additional transformer layers with multi-head self-attention and MLPs, followed by 3 transpose-convolution layers (yielding ), and a final linear “unpatchify” layer to output.
- Supervision: Training is supervised by ground-truth surface normals and UV maps generated from registered 3D scans.
Schematic (editor’s textual description):
1 2 3 |
RGB image → DINOv2 Backbone (12 blocks)
→ [4 transformer blocks + up-conv + linear]
→ [Surface Normal map or UV map] |
Fine-tuning runs with Adam (batch size 40) for ~3 days on 2x48GB GPUs, where backbone learning rate is set 10× lower than for the prediction heads.
2. Geometric Cue Prediction
Given an input image, the networks output dense per-pixel estimates:
- —predicted surface normal at pixel .
- —predicted UV-face coordinate at pixel .
Let denote the binary face mask derived from registration. The objective functions are:
The system infers geometric structure in a dense, screen-space fashion, enabling granular 3D priors for model fitting. This design contrasts prior approaches such as Pix2face (Crispell et al., 2017), which regresses per-pixel “PNCC” (mean-face correspondence) and offsets using a modified U-Net architecture without transformer-based priors.
3. FLAME Fitting Optimization
At inference, the Pixel3DMM outputs are used to solve for FLAME 3DMM parameters via an energy-based optimization. The parameters are grouped as:
- FLAME identity, expression, and pose:
- Camera:
For each mesh vertex with canonical UV coordinate , the closest pixel in the predicted UV map is located, and reprojection consistency is enforced within a threshold :
A surface normal consistency loss compares predicted normals to those rendered from the mesh. Regularization terms enforce proximity to MICA-derived identity and penalize expression magnitude:
The total fitting energy is:
Optimization runs for 500 steps with Adam and takes ≈30 s per image.
4. Data Registration and Training Datasets
Training and evaluation rely on large-scale, high-fidelity 3D face datasets registered to the unified FLAME mesh (with dense correspondence):
| Dataset | Identities | Expressions | Views/Cameras | RGB/Normal/UV Pairs |
|---|---|---|---|---|
| NPHM | 470 | 23 | 40 | 376,000 triplets |
| FaceScape | 350 | 20 | 50 | 350,000 triplets |
| Ava256 | video | 50 (FPS sampling) | 20 | 250,000 RGB/UV (no normals) |
All training examples undergo random lighting synthesis (point lights, IC-Light relighting), random background, and randomized camera intrinsics, exposing the network to a wide appearance distribution.
5. Benchmarking and Evaluation
Pixel3DMM introduces a new benchmark based on NeRSemble multi-view video scans, including:
- Subjects: 21 identities
- Expressions: 420 expressive frames (20 per identity)
- Neutral: 21 high-resolution neutral scans (COLMAP-reconstructed)
Tasks:
(a) Posed Geometry Reconstruction: Given an expressive/posed image, recover expression and identity geometry.
(b) Neutral Geometry Recovery: Given a posed image, reconstruct subject's neutral mesh.
The system achieves a 17% reduction in error for posed meshes over previous approaches.
6. Strengths, Limitations, and Prospects
Strengths:
- Dense per-pixel normal and UV predictions markedly improve 3DMM fitting robustness, especially under extreme facial expressions, occlusion, and non-neutral lighting.
- Generalization to in-the-wild imagery is bolstered by both architectural and data diversity.
- Benchmark supports comprehensive analysis: it is the first to evaluate posed and neutral geometry simultaneously.
Limitations:
- Expression and identity disentanglement in the optimization can be confounded, resulting in modest performance gains on the neutral recovery task.
- Priors are derived from a single image; multi-view or temporal correlations are not currently exploited.
- The method is not real-time. Each inference requires test-time fitting (∼30s per image), precluding feed-forward applications.
Future Work (as stated):
- Distillation of geometric priors into a direct FLAME regressor for speed.
- Extension to multi-view and video-based architectures.
- More advanced disentanglement energies between identity and expression.
A plausible implication is that Pixel3DMM’s screen-space prior strategy may inspire similar hybrid frameworks integrating dense geometric learning with parametric model-based optimization, particularly as high-capacity transformers and large annotated 3D datasets continue to expand the practical frontier of single-view 3D perception.