Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

CodeFormer++: Modular Blind Face Restoration

Updated 4 November 2025
  • CodeFormer++ is a modular framework that decomposes blind face restoration into identity preservation, high-quality generation, deformable alignment, and texture-identity fusion.
  • It integrates a deformable face registration module and a texture attention network to dynamically balance realistic texture details with identity consistency.
  • Metric learning with a hard anchor-positive approach ensures that the final output achieves state-of-the-art performance in both perceptual quality and identity fidelity.

CodeFormer++ is a modular framework for blind face restoration (BFR) that systematically addresses the challenge of reconstructing high-fidelity, identity-preserving faces from inputs subject to complex, unknown degradations. The framework advances beyond prior generative prior-based solutions by decomposing BFR into structurally distinct stages: identity-preserving restoration, high-quality generation, semantic alignment via deformable registration, and dynamic fusion through metric learning. This architectural separation enables CodeFormer++ to circumvent the conventional trade-off between visual fidelity and identity consistency.

1. Problem Decomposition and Pipeline Structure

CodeFormer++ explicitly partitions the restoration problem into four consecutive domains:

  1. Identity-preserving restoration (CF-ID): Generates output leveraging maximal information from the degraded input, prioritizing facial identity (by setting fusion scalar w=1w=1 in CodeFormer (Zhou et al., 2022)) at the expense of realistic texture.
  2. High-quality generation (CF-GP): Synthesizes images primarily from generative priors, optimizing for rich texture and natural appearance (w=0w=0), but potentially compromising identity cues.
  3. Deformable alignment: Applies spatial registration to semantically align CF-GP with CF-ID, preparing them for effective fusion.
  4. Texture-identity fusion (metric learning): Integrates realistic texture details from the aligned generative output into the identity-preserving base using deep metric learning for optimal perceptual and identity balance.

This pipeline is underpinned by three key modules: a deformable face registration network (DAM), a texture-guided restoration network (TGRN) with a texture attention module (TAM), and a deep metric learning supervision regime.

2. Deformable Face Registration Module (DAM)

Objective: DAM mitigates semantic and geometric discrepancies between CF-ID (IFI_F) and CF-GP (IGI_G), especially across structurally critical facial regions.

Mechanism:

  • Deformable registration: A trainable function Rθ(IF,IG)R_\theta(I_F, I_G) predicts a dense flow field ϕ\phi for non-rigid warping:

Rθ(IF,IG)=ϕR_\theta(I_F, I_G) = \phi

  • Warping operation: IGI_G is spatially transformed via ϕ\phi using a differentiable spatial sampler, yielding IwarpI_{\text{warp}}, which aligns with IFI_F while retaining IGI_G’s texture.

Loss functions:

  • Similarity loss (LsimL_{\text{sim}}): Implements negative local normalized cross-correlation over image patches to maximize spatial correspondence.
  • Smoothness loss (LsmoothL_{\text{smooth}}): Encourages spatial coherence in the flow:

Lsmooth(ϕ)=pΩϕ(p)2L_{\text{smooth}}(\phi) = \sum_{p\in\Omega} \|\nabla \phi(p)\|^2

  • Total DAM loss: Weighted sum:

L(IF,IG,ϕ)=Lsim(IF,IG(ϕ))+λϕLsmooth(ϕ)L(I_F, I_G, \phi) = L_{\text{sim}}(I_F, I_G(\phi)) + \lambda_\phi L_{\text{smooth}}(\phi)

Significance: DAM successfully generates a texture-rich, structurally aligned prior (IwarpI_{\text{warp}}) for downstream fusion, addressing the spatial bias that impedes naive generative combination.

3. Texture-Guided Restoration Network (TGRN) and Texture Attention Module (TAM)

Architecture:

  • Inputs: Identity restoration (IFI_F) and aligned generative prior (IwarpI_{\text{warp}}).
  • TGRN backbone: U-Net structure processes IFI_F; multi-scale encoder features ZeiZ^i_e are extracted.
  • TAM: Extracts multi-level texture features ZtiZ^i_t from IwarpI_{\text{warp}} via hierarchical convolutional and residual blocks. Adaptive pooling synchronizes spatial dimensions.

Fusion mechanism:

  • Global descriptors: Mean-pooled vectors veiv^i_e (identity) and vtiv^i_t (texture) computed at each scale.
  • MLP fusion weights: Concatenated descriptors passed through an MLP yield fusion coefficients [wei,wti][w^i_e, w^i_t].
  • Feature blending: Elementwise:

Zmi=weiZei+wtiZtiZ^i_m = w^i_e \odot Z^i_e + w^i_t \odot Z^i_t

  • Decoding: Fused features are decoded to obtain the final restoration, balancing structure and perceptual realism.

Loss regime:

  • Regression loss (L1L_1): Pixel-wise error to ground truth.
  • Adversarial loss (LadvL_{adv}): GAN-based discriminator encourages realism.
  • Identity loss (LidL_{id}): ArcFace embedding distances enforce identity fidelity.
  • Metric learning loss (LtripletL_{triplet}): Supervises fusion at the representation level.
  • Joint objective:

Ltotal=λl1L1+λadvLadv+λidLid+LtripletL_{total} = \lambda_{l1} L_1 + \lambda_{adv} L_{adv} + \lambda_{id} L_{id} + L_{triplet}

Context: Classic loss-only fusion can suppress artifacts but fails to exploit texture from the prior. The adaptive fusion enabled by TAM addresses this shortcoming.

4. Deep Metric Learning Integration

Purpose: Metric learning constrains the fused output to capture both texture realism and identity preservation.

Anchor-positive construction:

  • Hard anchor-positive (IAPI_{AP}): Fuses facial features (eyes, nose, mouth) from IFI_F with context/skin from IwarpI_{\text{warp}} using a binary mask MM:

IAP=IFM+Iwarp(1M)I_{AP} = I_F * M + I_{\text{warp}} * (1 - M)

  • Negative sample: IFI_F (lacks realistic texture).

Triplet loss (cosine embedding):

  • Embeddings fpf_p (IAPI_{AP}), faf_a (network output), fnf_n (IFI_F) extracted via pretrained VGG.
  • Loss formulation:

Ltriplet=λtripletlogecos(θ+)ecos(θ+)+ecos(θ)L_{triplet} = -\lambda_{triplet}\, \log \frac{e^{\cos(\theta^+)}}{e^{\cos(\theta^+)} + e^{\cos(\theta^-)}}

Encourages the final output to approach the hard anchor-positive and diverge from the identity-only baseline.

Effect: This approach ensures that metric learning is nontrivial in the restoration context; naively using ground-truth as a positive is ineffective since it is generally out-of-distribution with respect to the warped generative prior.

5. Component Synergy and Architectural Diagram

The CodeFormer++ pipeline is characterized by a staged, modular synergy:

  • DAM aligns and prepares features for fusion, removing geometric bias.
  • TGRN/TAM dynamically attend to and blend identity and texture at varying scales.
  • Metric learning supervises this fusion in the embedding space, optimizing for the delicate equilibrium between sharpness and identity.

Architectural flow:

  • Input LQ image → CodeFormer with w=1w=1 (CF-ID) and w=0w=0 (CF-GP)
  • CF-GP and CF-ID → DAM → IwarpI_{\text{warp}}
  • IFI_F and IwarpI_{\text{warp}} → TGRN with TAM → output (metric learning loss supervision)

6. Experimental Validation and Generalization

Benchmark datasets: FFHQ (train), CelebA-Test (synthetic), LFW-Test, WebPhoto-Test, WIDER-Test (real-world, mixed degradations).

Metrics: PSNR, SSIM, FID, NIQE, LPIPS (perceptual), LMD (identity/landmark).

Results:

  • CelebA-Test: CodeFormer++ reports FID=38.13 (best), LPIPS=0.341 (second-best), LMD=5.41 (second-best), demonstrating optimal perceptual quality and near SOTA identity fidelity.
  • Real-world datasets: Outperforms all competitors in FID, matches/exceeds NIQE and identity scores.
  • Qualitative: Visual results exhibit sharp detail and accurate identity retention under severe degradation, as shown in figures.

Ablation studies:

  1. DAM alone enhances LMD but introduces artifacts.
  2. TGRN with classic loss removes artifacts, but texture is insufficient.
  3. Naive metric learning with ground truth positive fails.
  4. Hard anchor-positive construction yields truly balanced fusion, confirming its necessity (Table 3, Fig. 5).

Extension: The fusion and metric learning pipeline generalizes to other generative priors (RestoreFormer, DAEFR, DifFace); improvements in identity metrics are retained across backbone swaps (Table 4, Fig. 6).

7. Core Contributions and Scientific Impact

  1. Modular decomposition of BFR tasks allows targeted optimization and circumvents oversimplified trade-offs in prior work (Reddem et al., 6 Oct 2025).
  2. Learning-based deformable registration achieves semantically aligned, texture-rich feature fusion without geometric compromise.
  3. Adaptive attention-based fusion determines optimal blending at each encoder level.
  4. Hard metric learning creates a meaningful relational embedding for facial restoration tasks.
  5. State-of-the-art performance verified through robust benchmarks and thorough ablations.

A plausible implication is that the architectural separation and learned alignment mechanisms of CodeFormer++ provide a blueprint for restoration pipelines where the interplay of semantic structure and texture is critical to balancing perceptual quality and application-specific identities. Its generalizable fusion design further suggests utility in domains beyond BFR, wherever prior and evidence must be dynamically and adaptively merged.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CodeFormer++ Framework.