OCFormer: Over-Smoothing Correction Transformer

Updated 29 November 2025

OCFormer is a transformer-based module that explicitly mitigates over-smoothing by integrating fidelity correction and latent-prior guided attention to preserve diverse features.
It combines Degradation-Resistant Attention and Prior-Guided Detail Recovery to effectively recover fine details lost in deep transformer layers.
Empirical results show notable improvements in image restoration PSNR and transformer performance metrics, demonstrating its practical impact on maintaining high-frequency content.

The Over-Smoothing Correction Transformer (OCFormer) is a class of transformer-based modules designed to explicitly address the over-smoothing phenomenon in deep transformer networks, particularly in the context of blind image restoration and generic deep transformer architectures. Over-smoothing refers to the degradation of representation diversity, where deep layers produce nearly constant token or feature vectors, suppressing high-frequency content and fine-grained discriminative features. OCFormer modules integrate fidelity correction principles or latent-prior–guided attention architectures to maintain semantic diversity and recover fine detail, making them crucial in contexts where preservation of high-frequency or local information is essential (He et al., 22 Nov 2025, Nguyen et al., 2023).

1. Over-Smoothing in Transformer Networks

Over-smoothing arises when, with increasing depth, transformer networks tend to minimize a nonlocal functional that enforces global consistency among token or feature representations. This process, formalized as the minimization of a smoothness-promoting energy functional

$L_0(u) = \frac12 \iint_{\Omega\times\Omega} \|u(x)-u(y)\|^2 k(x, y)\,dx\,dy$

drives the features $u(x)$ , representing token or spatial embeddings, toward homogeneous fixed points. In traditional transformers, this phenomenon is exacerbated by repeated softmax self-attention layers, eventually causing token representations to become nearly identical and thus undermining the model's representational power in downstream tasks (Nguyen et al., 2023).

In image restoration, a related issue emerges when deep unfolding networks (DUNs) employ gradient-based steps dominated by low-frequency signals. The “proximal” module, if not carefully designed, can propagate and reinforce this smoothness, leading to the loss of high-frequency texture and detail (He et al., 22 Nov 2025).

2. Architectural Principles of OCFormer

Two principal architectural paradigms for OCFormer have been developed, corresponding to their application domains:

Latent-Driven U-Shaped Transformer (in UnfoldLDM): In blind image restoration, OCFormer is implemented as a multi-scale, U-shaped transformer within every unfolding stage. Each block consists of two key components:
- Degradation-Resistant Attention (DRA): Standard self-attention modulated by degradation-aware input features, with outputs summed residually back to the features.
- Prior-Guided Detail Recovery (PDR): Conditional affine transformations and corrections guided by a latent diffusion prior, introduced through conditional layer normalization and non-linear gating.
- Input/Output Interface: Receives the outputs from a multi-granularity degradation-aware (MGDA) module ( $\hat{x}_k, \tilde{x}_k \in \mathbb{R}^{C\times H\times W}$ ) concatenated along the channel axis, and a compact prior vector ( $P_k^h \in \mathbb{R}^{C_p}$ ). Outputs a restored feature map $\mathbf{x}_k$ (He et al., 22 Nov 2025).
Fidelity-Regulated Attention (in generic transformers): In standard NLP/CV transformers, OCFormer instantiates a fidelity-regularized update per layer:

$u^{(\ell)}(i) = \mathrm{SoftmaxAttention}(q^{(\ell)}(i), K^{(\ell)}, V^{(\ell)}) + \tilde{\lambda}[V^{(0)}(i) - V^{(\ell)}(i)]$

Here, a residual correction proportional to the discrepancy between the original layer-0 embedding and the current feature is added, where $\tilde\lambda$ strongly controls the degree of anti-smoothing (Nguyen et al., 2023).

3. Mathematical Formulation

OCFormer in UnfoldLDM

Feature Extraction: Input features are mapped as

$\mathbf{F} = \mathrm{Conv}_{3\times 3}\bigl([\hat{x}_k;\, \tilde{x}_k]\bigr)$

DRA Block: Single-head attention with depthwise convolutional $W_Q, W_K, W_V$ and residual addition ensures both modeling flexibility and gradient flow:

$\begin{align*} \mathbf{Q} &= W_Q \mathbf{F},\quad \mathbf{K} = W_K \mathbf{F},\quad \mathbf{V} = W_V \mathbf{F} \ \mathbf{A} &= \mathrm{Softmax}(\mathbf{Q}\mathbf{K}^\top / I) \ \mathbf{F}' &= \mathbf{A} \mathbf{V} + \mathbf{F} \end{align*}$

PDR Block: Given the latent prior $P_k^h$ ,

$\begin{align*} \mathbf{F}'' &= \mathrm{Linear}_1(P^h_k) \odot \mathrm{LN}(\mathbf{F}') + \mathrm{Linear}_2(P^h_k) \ \mathbf{F}_\mathrm{out} &= \mathbf{F}' + \mathrm{GELU}(W_G \mathbf{F}'') \odot W_H \mathbf{F}'' \end{align*}$

Final Output: After sequential DRA+PDR blocks, a final convolution maps back to the image feature space:

$\mathbf{x}_k = \mathrm{Conv}_{3\times 3}(\mathbf{F}_\mathrm{final})$

OCFormer as Fidelity Regularized Transformer

The regularization framework introduces a correction term $\tilde\lambda(V^{(0)} - V^{(\ell)})$ into each self-attention output, directly derived from the gradient flow of the sum of a smoothness and a fidelity regularizer:

$L(u; f) = L_0(u) + \lambda R(u;f)$

4. Conditioning on Latent Diffusion Priors

In UnfoldLDM, the OCFormer’s PDR block depends critically on latent priors produced by denoised diffusion processes:

Diffusion Prior Extraction: A degradation-resistant latent diffusion model (DR-LDM) infers a compact prior $P^h_k$ from MGDA output, itself conditioned via a learned encoder $\mathrm{PI}'(\cdot)$ analyzing the concatenated MGDA stream.
This prior acts as a gating and normalization signal, enabling the PDR block to reintroduce and amplify high-frequency components lost due to gradient descent or over-smoothing. It is a form of conditional layer normalization where the prior modulates both scale and bias, informed by clean high-level features (He et al., 22 Nov 2025).

5. Training Objectives and Optimization

Training of OCFormer-based architectures proceeds in two phases:

Phase I (Prior Inference Pre-training):
- Reconstruction Loss: $\mathcal{L}_\mathrm{Rec} = \| \mathbf{x}_K - \mathbf{x}_\mathrm{GT} \|_1$
- Intra-Stage Degradation-Aware Loss: $\mathcal{L}_\mathrm{ISDA} = \sum_{k=2}^K 2^{k-K} \| \hat{x}_k - \tilde{x}_k \|_1$
- The total Phase I loss: $\mathcal{L}^I_\mathrm{Total} = \mathcal{L}_\mathrm{Rec} + \zeta_1 \mathcal{L}_\mathrm{ISDA}$
Phase II (Diffusion Prior Optimization):
- Diffusion Consistency Loss: $\mathcal{L}_\mathrm{Diff} = \| \hat{P}^h_k - P^h_k \|_1$
- The total loss: $\mathcal{L}^{II}_\mathrm{Total} = \mathcal{L}_\mathrm{Rec} + \zeta_2 \mathcal{L}_\mathrm{ISDA} + \zeta_3 \mathcal{L}_\mathrm{Diff}$
- In practice, $\zeta_1 = \zeta_2 = \zeta_3 = 1$

For fidelity-regularized transformers, no separate loss term beyond existing objectives is required; the correction arises from architectural modification (Nguyen et al., 2023).

6. Empirical Performance and Ablation Results

Quantitative ablation studies demonstrate the indispensability of OCFormer components in both image restoration and generic transformer tasks:

Blind Image Restoration (LOL-v2):
- Removing DRA ("w/o DRA"): PSNR drops from 23.58→23.26 dB (real) and 27.92→27.38 dB (synthetic)
- Removing PDR ("w/o PDR"): PSNR drops further to 22.39 dB / 25.96 dB
- Removing the latent diffusion prior ("w/o DR-LDM"): PSNR collapses to 22.08 dB / 25.27 dB
- These findings indicate DRA recovers complementary textures, PDR (with latent prior) reinstates suppressed high-frequency details, and the prior itself is crucial for distinguishing true detail from noise (He et al., 22 Nov 2025).
Generic Transformer Architectures:
- Classification (ImageNet/DeiT-tiny): Top-1 accuracy improved from 72.17% (baseline) to 73.01% with OCFormer.
- Segmentation (ADE20K): mIoU improved from 35.72 to 37.24.
- Language Modeling (WikiText-103): Perplexity reduced from 33.15 to 32.60 (valid).
- Over-smoothing Metric: Average pairwise cosine similarity in deep layers is >0.9 in standard transformers, but stays at ≈0.6–0.7 in OCFormer, indicating preserved diversity (Nguyen et al., 2023).

Computational overhead is negligible, e.g., the fidelity correction term adds ∼0.005% extra FLOPs.

7. Practical Considerations and Compatibility

OCFormer modules can be integrated as plug-and-play enhancements. In the UnfoldLDM framework, all OCFormer blocks share weights across stages, and the design is directly compatible with existing DUN-based methods (He et al., 22 Nov 2025). For generic transformers, substituting the standard attention step with the OCFormer correction term requires only minor modification and retains full interoperability with modern frameworks (PyTorch, HuggingFace Transformers). No extra memory or significant computational penalty is incurred.

OCFormer’s effects are most prominent in deeper architectures and tasks requiring fine-grained detail retention, such as segmentation and language modeling with long context dependencies. Stability is robust under standard training protocols, with hyperparameters for anti-smoothing terms ( $\tilde{\lambda}$ ) tuned per architecture and task (Nguyen et al., 2023).

In summary, the Over-Smoothing Correction Transformer synthesizes architectural innovations—fidelity correction via energy functional discretization or latent-prior–guided conditional normalization—to overcome the intrinsic over-smoothing bias of both deep transformer and unfolding-based networks. This enables effective recovery and preservation of high-frequency, discriminative detail in both visual and language domains (He et al., 22 Nov 2025, Nguyen et al., 2023).