CodeFormer++: Modular Blind Face Restoration
- CodeFormer++ is a modular framework that decomposes blind face restoration into identity preservation, high-quality generation, deformable alignment, and texture-identity fusion.
- It integrates a deformable face registration module and a texture attention network to dynamically balance realistic texture details with identity consistency.
- Metric learning with a hard anchor-positive approach ensures that the final output achieves state-of-the-art performance in both perceptual quality and identity fidelity.
CodeFormer++ is a modular framework for blind face restoration (BFR) that systematically addresses the challenge of reconstructing high-fidelity, identity-preserving faces from inputs subject to complex, unknown degradations. The framework advances beyond prior generative prior-based solutions by decomposing BFR into structurally distinct stages: identity-preserving restoration, high-quality generation, semantic alignment via deformable registration, and dynamic fusion through metric learning. This architectural separation enables CodeFormer++ to circumvent the conventional trade-off between visual fidelity and identity consistency.
1. Problem Decomposition and Pipeline Structure
CodeFormer++ explicitly partitions the restoration problem into four consecutive domains:
- Identity-preserving restoration (CF-ID): Generates output leveraging maximal information from the degraded input, prioritizing facial identity (by setting fusion scalar in CodeFormer (Zhou et al., 2022)) at the expense of realistic texture.
- High-quality generation (CF-GP): Synthesizes images primarily from generative priors, optimizing for rich texture and natural appearance (), but potentially compromising identity cues.
- Deformable alignment: Applies spatial registration to semantically align CF-GP with CF-ID, preparing them for effective fusion.
- Texture-identity fusion (metric learning): Integrates realistic texture details from the aligned generative output into the identity-preserving base using deep metric learning for optimal perceptual and identity balance.
This pipeline is underpinned by three key modules: a deformable face registration network (DAM), a texture-guided restoration network (TGRN) with a texture attention module (TAM), and a deep metric learning supervision regime.
2. Deformable Face Registration Module (DAM)
Objective: DAM mitigates semantic and geometric discrepancies between CF-ID () and CF-GP (), especially across structurally critical facial regions.
Mechanism:
- Deformable registration: A trainable function predicts a dense flow field for non-rigid warping:
- Warping operation: is spatially transformed via using a differentiable spatial sampler, yielding , which aligns with while retaining ’s texture.
Loss functions:
- Similarity loss (): Implements negative local normalized cross-correlation over image patches to maximize spatial correspondence.
- Smoothness loss (): Encourages spatial coherence in the flow:
- Total DAM loss: Weighted sum:
Significance: DAM successfully generates a texture-rich, structurally aligned prior () for downstream fusion, addressing the spatial bias that impedes naive generative combination.
3. Texture-Guided Restoration Network (TGRN) and Texture Attention Module (TAM)
Architecture:
- Inputs: Identity restoration () and aligned generative prior ().
- TGRN backbone: U-Net structure processes ; multi-scale encoder features are extracted.
- TAM: Extracts multi-level texture features from via hierarchical convolutional and residual blocks. Adaptive pooling synchronizes spatial dimensions.
Fusion mechanism:
- Global descriptors: Mean-pooled vectors (identity) and (texture) computed at each scale.
- MLP fusion weights: Concatenated descriptors passed through an MLP yield fusion coefficients .
- Feature blending: Elementwise:
- Decoding: Fused features are decoded to obtain the final restoration, balancing structure and perceptual realism.
Loss regime:
- Regression loss (): Pixel-wise error to ground truth.
- Adversarial loss (): GAN-based discriminator encourages realism.
- Identity loss (): ArcFace embedding distances enforce identity fidelity.
- Metric learning loss (): Supervises fusion at the representation level.
- Joint objective:
Context: Classic loss-only fusion can suppress artifacts but fails to exploit texture from the prior. The adaptive fusion enabled by TAM addresses this shortcoming.
4. Deep Metric Learning Integration
Purpose: Metric learning constrains the fused output to capture both texture realism and identity preservation.
Anchor-positive construction:
- Hard anchor-positive (): Fuses facial features (eyes, nose, mouth) from with context/skin from using a binary mask :
- Negative sample: (lacks realistic texture).
Triplet loss (cosine embedding):
- Embeddings (), (network output), () extracted via pretrained VGG.
- Loss formulation:
Encourages the final output to approach the hard anchor-positive and diverge from the identity-only baseline.
Effect: This approach ensures that metric learning is nontrivial in the restoration context; naively using ground-truth as a positive is ineffective since it is generally out-of-distribution with respect to the warped generative prior.
5. Component Synergy and Architectural Diagram
The CodeFormer++ pipeline is characterized by a staged, modular synergy:
- DAM aligns and prepares features for fusion, removing geometric bias.
- TGRN/TAM dynamically attend to and blend identity and texture at varying scales.
- Metric learning supervises this fusion in the embedding space, optimizing for the delicate equilibrium between sharpness and identity.
Architectural flow:
- Input LQ image → CodeFormer with (CF-ID) and (CF-GP)
- CF-GP and CF-ID → DAM →
- and → TGRN with TAM → output (metric learning loss supervision)
6. Experimental Validation and Generalization
Benchmark datasets: FFHQ (train), CelebA-Test (synthetic), LFW-Test, WebPhoto-Test, WIDER-Test (real-world, mixed degradations).
Metrics: PSNR, SSIM, FID, NIQE, LPIPS (perceptual), LMD (identity/landmark).
Results:
- CelebA-Test: CodeFormer++ reports FID=38.13 (best), LPIPS=0.341 (second-best), LMD=5.41 (second-best), demonstrating optimal perceptual quality and near SOTA identity fidelity.
- Real-world datasets: Outperforms all competitors in FID, matches/exceeds NIQE and identity scores.
- Qualitative: Visual results exhibit sharp detail and accurate identity retention under severe degradation, as shown in figures.
Ablation studies:
- DAM alone enhances LMD but introduces artifacts.
- TGRN with classic loss removes artifacts, but texture is insufficient.
- Naive metric learning with ground truth positive fails.
- Hard anchor-positive construction yields truly balanced fusion, confirming its necessity (Table 3, Fig. 5).
Extension: The fusion and metric learning pipeline generalizes to other generative priors (RestoreFormer, DAEFR, DifFace); improvements in identity metrics are retained across backbone swaps (Table 4, Fig. 6).
7. Core Contributions and Scientific Impact
- Modular decomposition of BFR tasks allows targeted optimization and circumvents oversimplified trade-offs in prior work (Reddem et al., 6 Oct 2025).
- Learning-based deformable registration achieves semantically aligned, texture-rich feature fusion without geometric compromise.
- Adaptive attention-based fusion determines optimal blending at each encoder level.
- Hard metric learning creates a meaningful relational embedding for facial restoration tasks.
- State-of-the-art performance verified through robust benchmarks and thorough ablations.
A plausible implication is that the architectural separation and learned alignment mechanisms of CodeFormer++ provide a blueprint for restoration pipelines where the interplay of semantic structure and texture is critical to balancing perceptual quality and application-specific identities. Its generalizable fusion design further suggests utility in domains beyond BFR, wherever prior and evidence must be dynamically and adaptively merged.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free