PercHead: 3D Head Reconstruction & Editing

Updated 9 November 2025

PercHead is a method for converting a single RGB face image into a complete 3D head model with semantic editing capabilities, integrating perceptual supervision and a dual-branch transformer architecture.
It achieves robust novel view synthesis and precise identity preservation by leveraging cross-attention between 2D encoder features and 3D patch embeddings, enhanced through differentiable Gaussian splatting.
Its semantic editing feature enables independent manipulation of head geometry and style using segmentation maps and CLIP-inspired text prompts, offering high flexibility and control.

PercHead is a method for single-image 3D head reconstruction and semantic 3D editing that integrates perceptual supervision, dual-branch transformer architectures, and differentiable Gaussian splatting for high-fidelity and view-consistent output. It enables transforming a single RGB face image into a full 3D head representation and further supports semantic editing—where head geometry and appearance can be independently manipulated through segmentation maps and prompts—surpassing previous models in novel view synthesis, robustness to occlusions, and editing flexibility (Oroz et al., 4 Nov 2025).

1. Model Architecture

PercHead leverages a dual-branch encoder and a transformer-based 3D decoder to lift 2D appearance signals into a unified 3D head representation.

Input Preprocessing: Each $512 \times 512$ RGB face image is cropped and background-masked with a tracker adapted from GAGAvatar.
Dual-Branch Encoder:
- Branch 1: A frozen DINOv2 backbone extracts dense patch features from intermediate layers ( $\ell \in \{9, 19, 29, 39\}$ ), capturing multi-scale semantic and low-level structure.
- Branch 2: A lightweight ViT (inspired by MAE), trained from scratch, captures fine-grained, context-specific features not available in DINOv2.
- Background patches (approx. 30%) are masked out to reduce computation. Surviving patches' feature vectors from both branches are concatenated and projected into the decoder embedding space via an MLP:
$F_{2D} = \text{MLP}([F^{i}_{\text{Enc}}; F^{i}_{\text{Dino}}]_{i=1}^{|P|}) \in \mathbb{R}^{|P| \times D}$

where $D$ is the decoder dimension (typically 512).
ViT-Based 3D Decoder with Iterative Cross-Attention:
- The head is initialized from a fixed FLAME mesh ( $\approx 65$ k vertices), partitioned into 4096 3D patches, each with 16 vertices and a learnable embedding.
- Each decoder layer ( $i = 1, ..., L$ ) applies cross-attention between current 3D patch embeddings ( $Q$ ) and 2D encoder features ( $K, V$ ), then refines with a per-patch MLP and residual connections:
$F^{i}_{3D} = F^{i-1}_{3D} + \text{MLP}^{i}\bigl( F^{i-1}_{3D} + \text{ATTN}^{i}(F^{i-1}_{3D}, F_{2D}) \bigr)$ - Cross-attention equations (fully stated):

$Q = XW_Q, \quad K = YW_K, \quad V = YW_V$

$\text{ATTN}(X, Y) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$

where $X \in \mathbb{R}^{N_{3D} \times D}$ , $Y \in \mathbb{R}^{N_{2D} \times D}$ , and $d_k$ is the key dimension.
3D-to-3D self-attention is deliberately omitted to limit computational burden; coherence arises from global cross-attention to the shared 2D context.

2. Perceptual Supervision and Losses

PercHead eschews pixel-wise losses in favor of multi-layer perceptual signals:

Supervision Sources:
- DINOv2 (layers 8, 11): Provides high-frequency and semantic correspondence signals, promoting accurate facial identity and detail.
- SAM 2.1 image encoder: Yields segmentation-oriented features emphasizing correct geometry and part delineation, e.g., eyes, hairline, mouth region.
Loss Formulation:
- Training is driven by purely perceptual loss:
$L_{\text{perc}} = \sum_{m \in \{\text{DINO}\ell_1, \text{DINO}\ell_2, \text{SAM}\}} \lambda_m \cdot (1 - \cos(\phi_m(I_{\text{render}}), \phi_m(I_{\text{gt}})))$

where $\phi_m(\cdot)$ is the $m$ -th feature extractor and $\lambda_m$ is the loss weight. - After freezing the 3D decoder, a 2D CNN "refinement head" is trained with a combination of $L_1$ , LPIPS, and the above perceptual losses:

$L_{\text{CNN}} = \lambda_1 \cdot \|I_{\text{refined}} - I_{\text{gt}}\|_1 + \lambda_{\text{LPIPS}} \cdot \text{LPIPS}(I_{\text{refined}}, I_{\text{gt}}) + \lambda_{\text{perc}} \cdot L_{\text{perc}}(I_{\text{refined}}, I_{\text{gt}})$

3. Differentiable Rendering via Gaussian Splatting

For efficient 3D-to-2D synthesis, PercHead utilizes differentiable Gaussian splatting:

Each of the 4096 decoded 3D patches is upsampled to 16 Gaussians via PixelShuffle, yielding $\sim 65,536$ Gaussians.
Each Gaussian is parameterized by:
- Mean $\mu_j \in \mathbb{R}^3$
- Covariance (diagonal scale and rotation) $\Sigma_j \in \mathbb{R}^{3 \times 3}$
- Color $c_j \in \mathbb{R}^3$
- Opacity $\alpha_j \in [0,1]$
The contribution of Gaussian $j$ to pixel $x$ :

$G_j(x) = \alpha_j c_j \exp\left(-\frac{1}{2}(x_{\text{proj}} - \mu_j)^\top \Sigma_j^{-1} (x_{\text{proj}} - \mu_j)\right)$

where $x_{\text{proj}}$ is the projection of the 3D center.

Final color is computed as $\sum_j G_j(x)$ . This result is sharpened with a lightweight 2D CNN.

4. Semantic 3D Editing: Disentangling Geometry and Style

PercHead supports semantic editing by encoder swapping and careful input modality disentanglement:

Editing Encoder:
- Geometry is prescribed by a 19-channel FARL segmentation map (semantic part labels).
- Appearance ("style") is specified either by a CLIP-encoded text prompt or a reference image; the global 512D CLIP style token is appended to each patch embedding.
Editing Decoder: Remains unchanged from the base reconstruction model.
Disentanglement Mechanism:
- Geometry is determined solely by the segmentation map, with no influence from CLIP style.
- Style is governed exclusively by the CLIP embedding, having no spatial effect on geometry.
- This approach achieves clean separation, with the model attending to the correct modality for each subtask.
Editing-Specific Loss: The base perceptual loss $L_{\text{perc}}$ is sufficient; an optional CLIP-guided loss $L_{\text{CLIP}} = 1 - \cos(\mathrm{CLIP}(I_{\text{render}}), z_{\text{style-ref}})$ may be used, but is not required.

5. Quantitative and Qualitative Performance

5.1. Novel View Synthesis

PercHead demonstrates strong results on Ava-256 (novel-view benchmarks, 11 heads × 5 views) and NeRSemble (5 heads × 16 views):

Method	PSNR (Ava-256)	LPIPS (Ava-256)	ArcFace (Ava-256)	PSNR (NeRSemble)	LPIPS (NeRSemble)	ArcFace (NeRSemble)
PercHead	16.08	0.2666	0.2935	18.04	0.1854	0.2559
GAGAvatar	15.87	0.2739	0.3481	16.88	0.2169	0.2883

Lower LPIPS and ArcFace values reflect greater perceptual and identity fidelity.

5.2. Extreme View Robustness

On challenging extreme left/right and vertical angles:

Method	PSNR (Ava-256)	LPIPS (Ava-256)	ArcFace (Ava-256)
PercHead	15.58	0.2866	0.2812
GAGAvatar	13.54	0.3643	0.5228

Minimal degradation is observed for PercHead relative to established baselines.

5.3. Qualitative and Video Consistency

Frame-by-frame analysis and visualizations demonstrate identity-preserving, detail-faithful renderings—maintained across time and pose transitions in both image and video domains.

6. Implementation Considerations and Limitations

Data: PercHead is trained on Nersemble (real multi-view), Cafca (synthetic multi-view), and FFHQ (single-view) datasets.
Computation:
- Base lifting model: 70 h on one RTX 3090 (perceptual losses only).
- CNN refinement: additional 24 h.
- Editing fine-tune: 30 h.
- Optimizer: AdamW, standard hyperparameters.
Runtime: Inference requires several seconds per frame, dominated by Gaussian splatting and CNN stages.
Limitations:
- No explicit dynamic expression transfer (e.g., reenactment is unsupported).
- Absence of real-time performance.
- Inability to relight: lighting is entangled in the input image; isolated editing is not possible.
- Failure cases can arise from extreme occlusions or nonhuman accessories.

7. Context and Significance

PercHead embodies a convergence of self-supervised vision, modern transformer design, and differentiable rendering. Its dual-branch encoder decouples global, semantic, and local, fine-grained cues; iterative cross-attention bridges 2D-to-3D relationships without computationally expensive 3D-3D self-attention. Perceptual supervision—eschewing MSE, $L_1$ , or SSIM losses except for final CNN refinement—directly optimizes fidelity in the embedding space of robust vision backbones (DINOv2, SAM2.1).

The semantic editing variant operationalizes the disentanglement of head geometry and style—enabling interactive tools where geometry and appearance can be sculpted and styled independently by end-users, with low parameter overhead.

Compared to classical HRTF-individualization pipelines essential for personalized binaural audio, PercHead’s pipeline—rapid smartphone-based scans, fast mesh-based simulation, learned latent compression and perceptual tuning—suggests practical deployment routes for audio-visual personalization systems (Guezenoc et al., 2020). While PercHead itself focuses on visual 3D head synthesis, its architecture and findings intersect with the future of perceptual user-specific modeling in audio, graphics, and multimodal systems.

PDF Markdown Chat (Pro)

References (2)

PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing (2025)

HRTF Individualization: A Survey (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to PercHead.