CodeFormer++: Blind Face Restoration

Updated 12 October 2025

CodeFormer++ is a blind face restoration framework that combines generative priors, deformable registration, and deep metric learning to overcome the quality versus identity trade-off.
It decomposes the restoration process into dedicated identity and generative branches, using adaptive fusion and deformable alignment to merge texture and structural features.
Deep metric learning with cosine triplet loss ensures that restored images maintain both perceptual detail and identity accuracy, outperforming previous methods.

CodeFormer++ denotes a blind face restoration framework that integrates generative priors, deformable registration, and deep metric learning to reconstruct high-fidelity faces from degraded inputs. It is designed to mitigate the canonical trade-off between visual quality and identity fidelity encountered in conventional generative face restoration pipelines. The system decomposes the restoration task into dedicated subtasks, leverages both identity-preserving and generative branches, and employs adaptive fusion mechanisms under rigorous metric supervision to simultaneously deliver realistic textures and preserve individual identity.

1. Architectural Design and Core Workflow

CodeFormer++ adopts a modular pipeline comprising identity restoration, generative face synthesis, deformable alignment, and texture-guided fusion:

Identity Preservation Branch (CF-ID): Processes the degraded input to restore facial structure and identity via a discriminative network.
Generative Prior Branch (CF-GP): Utilizes learned generative priors to synthesize high-quality, texture-rich facial imagery from the same input.
Deformable Alignment Module (DAM): Learns a dense deformation field $\phi$ through a function $R_\theta(I_F, I_G)$ to semantically and spatially align the outputs of CF-ID and CF-GP images. The generative image is warped: $I_{\text{warp}} = \text{Warp}(I_G, \phi)$ .
Texture Guided Restoration Network (TGRN): A three-level U-Net encoder-decoder structure enhanced by a Texture Attention Module (TAM) and dynamic fusion blocks. Identity and texture features are adaptively fused channel-wise:

$\begin{aligned} v^i_e &= \frac{1}{m\times n} \sum Z^i_e(s,t), \quad v^i_t = \frac{1}{m\times n} \sum Z^i_t(s,t) \ [w^i_e, w^i_t] &= \text{MLP}([v^i_e, v^i_t]) \ Z^i_m &= w^i_e \odot Z^i_e + w^i_t \odot Z^i_t \end{aligned}$

This sequence ensures that the final output merges the semantic integrity of the restored identity with the perceptual richness of the generative prior’s textures.

2. Deformable Facial Registration Mechanism

The DAM addresses the challenge of facial misalignment between the identity and generative branches, which can lead to structural artifacts. By using a parameterized registration function $R_\theta$ , DAM estimates a pixel-wise deformation field $\phi$ that warps the generative image toward the identity image. The warping is fully differentiable and optimized via a combination of similarity loss (based on local normalized cross-correlation) and a spatial smoothness regularization to preserve realistic spatial continuity. This alignment is critical to ensure that texture transfer occurs coherently and without unintentional distortion.

3. Texture Guided Restoration and Dynamic Feature Fusion

Texture Guided Restoration Network (TGRN) performs adaptive fusion of multi-scale identity and texture features. The encoder processes the CF-ID (identity-restored image) while TAM processes the aligned generative prior. The fusion at each network level is performed using channel-wise learned weights computed via a multi-layer perceptron fed with global average-pooled identity and texture feature vectors. The fused representation $Z^i_m$ is then decoded to yield the final restored face.

TGRN allows dynamic adaptation, emphasizing texture details in regions where generative priors offer enhanced realism and enforcing identity preservation where feature integrity is paramount. This design mitigates the tendency of early generative methods to either oversmooth details or compromise identity.

4. Deep Metric Learning for Identity-Texture Balancing

Deep metric learning is used to further enforce the joint optimization of visual fidelity and identity preservation. The framework generates an anchor-positive sample $I_{AP}$ by merging CF-ID and generative prior textures using a binary semantic mask $M$ :

$I_{AP} = I_F \ast M + I_{\text{warp}} \ast (1 - M)$

A cosine triplet loss is adopted with $f_p$ (positive: $I_{AP}$ ), $f_a$ (anchor: restored output $I_{out}$ ), and $f_n$ (hard negative: CF-ID):

$L_{\text{triplet}} = -\lambda_{\text{triplet}} \log \left[ \frac{\exp(\cos(\theta^+))}{\exp(\cos(\theta^+)) + \exp(\cos(\theta^-))} \right] \quad \text{where} \quad ||f||=1$

This loss encourages the restored output’s features to closely resemble those of $I_{AP}$ , which possesses both identity and textural fidelity, while distancing it from feature representations that are solely identity-driven. The overall objective additionally includes $L_1$ image loss, adversarial loss, and identity loss via ArcFace-like embedding supervision.

5. Experimental Validation and Comparative Analysis

CodeFormer++ is trained on images with synthetic degradations (Gaussian blur, noise, downsampling, JPEG compression) and evaluated on benchmarks (CelebA-Test, LFW-Test, WebPhoto, WIDER-Test) using PSNR, SSIM, LPIPS, FID, NIQE, and landmark distance (LMD):

Metric	CodeFormer++	Competing Methods
Perceptual FID	Best (lowest value)	Higher/Lossier
Identity LMD	Competitive/Superior	Lower/Distorted
Qualitative	Natural faces, faithful identities	Artifacts, identity drift

Ablation studies demonstrate the necessity of both deformable registration and triplet-based metric learning for optimal results. Previous methods were observed to either produce detail-oversmoothing or induce identity distortions under severe degradation, which CodeFormer++ successfully ameliorates.

6. Applications and Broader Implications

CodeFormer++ is applicable to restoration of low-quality photographs, archival materials, surveillance footage, and smartphone/AoI enhancement. The ability to reliably preserve identity enables its deployment in forensic, security, and heritage domains. The system’s modularity permits extension to other generative priors, transformer-based fusion strategies, and generalization to tasks beyond face restoration, such as video enhancement and VR content reconstruction.

A plausible implication is that CodeFormer++ sets a new standard for dynamic fusion in blind face restoration by demonstrating that explicit deformable alignment and metric-driven optimization are essential to overcoming the fidelity-quality trade-off. The paradigm of decomposing restoration and integrating learned priors with adaptive registration has potential relevance for other ill-posed visual restoration problems.

CodeFormer++ builds on the earlier CodeFormer framework (Zhou et al., 2022), which introduced codebook lookup transformers and controllable feature transformation for blind face restoration. Subsequent comparative studies employing CodeFormer in contexts such as webcam-based pupil diameter prediction (Shah et al., 19 Aug 2024) have affirmed its efficacy in face-centric super-resolution, though with observed trade-offs specific to downstream prediction tasks. The innovations in CodeFormer++—specifically, deformable alignment and deep metric learning—address limitations in fidelity-quality balancing that remained in prior generative models, such as PULSE and GFP-GAN.

In summary, CodeFormer++ advances a multi-stage, metric-optimized restoration framework that unifies semantic alignment and adaptive fusion, achieving high perceptual realism and robust identity preservation under challenging degradation—thereby contributing a substantial technical progression in the field of blind face restoration (Reddem et al., 6 Oct 2025).