CodeFormer++: Modular Blind Face Restoration

Updated 4 November 2025

CodeFormer++ is a modular framework that decomposes blind face restoration into identity preservation, high-quality generation, deformable alignment, and texture-identity fusion.
It integrates a deformable face registration module and a texture attention network to dynamically balance realistic texture details with identity consistency.
Metric learning with a hard anchor-positive approach ensures that the final output achieves state-of-the-art performance in both perceptual quality and identity fidelity.

CodeFormer++ is a modular framework for blind face restoration (BFR) that systematically addresses the challenge of reconstructing high-fidelity, identity-preserving faces from inputs subject to complex, unknown degradations. The framework advances beyond prior generative prior-based solutions by decomposing BFR into structurally distinct stages: identity-preserving restoration, high-quality generation, semantic alignment via deformable registration, and dynamic fusion through metric learning. This architectural separation enables CodeFormer++ to circumvent the conventional trade-off between visual fidelity and identity consistency.

1. Problem Decomposition and Pipeline Structure

CodeFormer++ explicitly partitions the restoration problem into four consecutive domains:

Identity-preserving restoration (CF-ID): Generates output leveraging maximal information from the degraded input, prioritizing facial identity (by setting fusion scalar $w=1$ in CodeFormer (Zhou et al., 2022)) at the expense of realistic texture.
High-quality generation (CF-GP): Synthesizes images primarily from generative priors, optimizing for rich texture and natural appearance ( $w=0$ ), but potentially compromising identity cues.
Deformable alignment: Applies spatial registration to semantically align CF-GP with CF-ID, preparing them for effective fusion.
Texture-identity fusion (metric learning): Integrates realistic texture details from the aligned generative output into the identity-preserving base using deep metric learning for optimal perceptual and identity balance.

This pipeline is underpinned by three key modules: a deformable face registration network (DAM), a texture-guided restoration network (TGRN) with a texture attention module (TAM), and a deep metric learning supervision regime.

2. Deformable Face Registration Module (DAM)

Objective: DAM mitigates semantic and geometric discrepancies between CF-ID ( $I_F$ ) and CF-GP ( $I_G$ ), especially across structurally critical facial regions.

Mechanism:

Deformable registration: A trainable function $R_\theta(I_F, I_G)$ predicts a dense flow field $\phi$ for non-rigid warping:

$R_\theta(I_F, I_G) = \phi$

Warping operation: $I_G$ is spatially transformed via $\phi$ using a differentiable spatial sampler, yielding $I_{\text{warp}}$ , which aligns with $I_F$ while retaining $I_G$ ’s texture.

Loss functions:

Similarity loss ( $L_{\text{sim}}$ ): Implements negative local normalized cross-correlation over image patches to maximize spatial correspondence.
Smoothness loss ( $L_{\text{smooth}}$ ): Encourages spatial coherence in the flow:

$L_{\text{smooth}}(\phi) = \sum_{p\in\Omega} \|\nabla \phi(p)\|^2$

Total DAM loss: Weighted sum:

$L(I_F, I_G, \phi) = L_{\text{sim}}(I_F, I_G(\phi)) + \lambda_\phi L_{\text{smooth}}(\phi)$

Significance: DAM successfully generates a texture-rich, structurally aligned prior ( $I_{\text{warp}}$ ) for downstream fusion, addressing the spatial bias that impedes naive generative combination.

3. Texture-Guided Restoration Network (TGRN) and Texture Attention Module (TAM)

Architecture:

Inputs: Identity restoration ( $I_F$ ) and aligned generative prior ( $I_{\text{warp}}$ ).
TGRN backbone: U-Net structure processes $I_F$ ; multi-scale encoder features $Z^i_e$ are extracted.
TAM: Extracts multi-level texture features $Z^i_t$ from $I_{\text{warp}}$ via hierarchical convolutional and residual blocks. Adaptive pooling synchronizes spatial dimensions.

Fusion mechanism:

Global descriptors: Mean-pooled vectors $v^i_e$ (identity) and $v^i_t$ (texture) computed at each scale.
MLP fusion weights: Concatenated descriptors passed through an MLP yield fusion coefficients $[w^i_e, w^i_t]$ .
Feature blending: Elementwise:

$Z^i_m = w^i_e \odot Z^i_e + w^i_t \odot Z^i_t$

Decoding: Fused features are decoded to obtain the final restoration, balancing structure and perceptual realism.

Loss regime:

Regression loss ( $L_1$ ): Pixel-wise error to ground truth.
Adversarial loss ( $L_{adv}$ ): GAN-based discriminator encourages realism.
Identity loss ( $L_{id}$ ): ArcFace embedding distances enforce identity fidelity.
Metric learning loss ( $L_{triplet}$ ): Supervises fusion at the representation level.
Joint objective:

$L_{total} = \lambda_{l1} L_1 + \lambda_{adv} L_{adv} + \lambda_{id} L_{id} + L_{triplet}$

Context: Classic loss-only fusion can suppress artifacts but fails to exploit texture from the prior. The adaptive fusion enabled by TAM addresses this shortcoming.

4. Deep Metric Learning Integration

Purpose: Metric learning constrains the fused output to capture both texture realism and identity preservation.

Anchor-positive construction:

Hard anchor-positive ( $I_{AP}$ ): Fuses facial features (eyes, nose, mouth) from $I_F$ with context/skin from $I_{\text{warp}}$ using a binary mask $M$ :

$I_{AP} = I_F * M + I_{\text{warp}} * (1 - M)$

Negative sample: $I_F$ (lacks realistic texture).

Triplet loss (cosine embedding):

Embeddings $f_p$ ( $I_{AP}$ ), $f_a$ (network output), $f_n$ ( $I_F$ ) extracted via pretrained VGG.
Loss formulation:

$L_{triplet} = -\lambda_{triplet}\, \log \frac{e^{\cos(\theta^+)}}{e^{\cos(\theta^+)} + e^{\cos(\theta^-)}}$

Encourages the final output to approach the hard anchor-positive and diverge from the identity-only baseline.

Effect: This approach ensures that metric learning is nontrivial in the restoration context; naively using ground-truth as a positive is ineffective since it is generally out-of-distribution with respect to the warped generative prior.

5. Component Synergy and Architectural Diagram

The CodeFormer++ pipeline is characterized by a staged, modular synergy:

DAM aligns and prepares features for fusion, removing geometric bias.
TGRN/TAM dynamically attend to and blend identity and texture at varying scales.
Metric learning supervises this fusion in the embedding space, optimizing for the delicate equilibrium between sharpness and identity.

Architectural flow:

Input LQ image → CodeFormer with $w=1$ (CF-ID) and $w=0$ (CF-GP)
CF-GP and CF-ID → DAM → $I_{\text{warp}}$
$I_F$ and $I_{\text{warp}}$ → TGRN with TAM → output (metric learning loss supervision)

6. Experimental Validation and Generalization

Benchmark datasets: FFHQ (train), CelebA-Test (synthetic), LFW-Test, WebPhoto-Test, WIDER-Test (real-world, mixed degradations).

Metrics: PSNR, SSIM, FID, NIQE, LPIPS (perceptual), LMD (identity/landmark).

Results:

CelebA-Test: CodeFormer++ reports FID=38.13 (best), LPIPS=0.341 (second-best), LMD=5.41 (second-best), demonstrating optimal perceptual quality and near SOTA identity fidelity.
Real-world datasets: Outperforms all competitors in FID, matches/exceeds NIQE and identity scores.
Qualitative: Visual results exhibit sharp detail and accurate identity retention under severe degradation, as shown in figures.

Ablation studies:

DAM alone enhances LMD but introduces artifacts.
TGRN with classic loss removes artifacts, but texture is insufficient.
Naive metric learning with ground truth positive fails.
Hard anchor-positive construction yields truly balanced fusion, confirming its necessity (Table 3, Fig. 5).

Extension: The fusion and metric learning pipeline generalizes to other generative priors (RestoreFormer, DAEFR, DifFace); improvements in identity metrics are retained across backbone swaps (Table 4, Fig. 6).

7. Core Contributions and Scientific Impact

Modular decomposition of BFR tasks allows targeted optimization and circumvents oversimplified trade-offs in prior work (Reddem et al., 6 Oct 2025).
Learning-based deformable registration achieves semantically aligned, texture-rich feature fusion without geometric compromise.
Adaptive attention-based fusion determines optimal blending at each encoder level.
Hard metric learning creates a meaningful relational embedding for facial restoration tasks.
State-of-the-art performance verified through robust benchmarks and thorough ablations.

A plausible implication is that the architectural separation and learned alignment mechanisms of CodeFormer++ provide a blueprint for restoration pipelines where the interplay of semantic structure and texture is critical to balancing perceptual quality and application-specific identities. Its generalizable fusion design further suggests utility in domains beyond BFR, wherever prior and evidence must be dynamically and adaptively merged.

PDF Markdown Chat (Pro)

References (2)

Towards Robust Blind Face Restoration with Codebook Lookup Transformer (2022)

CodeFormer++: Blind Face Restoration Using Deformable Registration and Deep Metric Learning (2025)

Follow Topic

Get notified by email when new papers are published related to CodeFormer++ Framework.