Denoising Vision Transformers

Updated 7 June 2026

Denoising Vision Transformers (DVT) are methods that mitigate grid artifacts in Vision Transformer outputs by modeling artifact-free features using neural fields.
The two-stage workflow combines per-image artifact removal with a lightweight predictor, significantly improving metrics like mIoU for dense tasks.
DVT techniques enable robust dense prediction and guide enhanced transformer designs by addressing translation equivariance issues inherent in traditional positional encodings.

Denoising Vision Transformers (DVT) refer to a set of methodologies and architectures for mitigating characteristic artifacts and noise in features or outputs produced by Vision Transformer (ViT) models. Originally introduced to address persistent "grid-like" artifacts stemming from the naive application of positional embeddings in ViTs, DVTs now encompass a growing body of approaches designed to enhance the fidelity of feature representations for dense prediction, low-level restoration, generative modeling, and beyond. The ubiquity of structured noise and spatial artifacts across ViT variants has established denoising as a critical step for the practical deployment of transformers in downstream dense and perceptual vision tasks (Yang et al., 2024).

1. Origin and Nature of Artifact Patterns in Vision Transformers

ViTs typically rely on additive, often learnable, positional embeddings that are introduced at the patch embedding stage to endow the model with 2D spatial inductive bias. However, it has been empirically shown that these positional encodings impute persistent, input-independent "grid" patterns to the intermediate feature maps—even for zero-input images with all pixels set to zero, the resulting features after LayerNorm display a fixed spatial artifact pattern. When positional embeddings are removed during training ("PE-free ViT"), these artifacts vanish while all other network parameters remain unchanged, pinpointing the embeddings as the source (Yang et al., 2024).

This artifact pattern is rigid under random crops, flips, and rescalings, remaining anchored in absolute image coordinates. Such behavior evidences that the artifacts arise from the violation of translation equivariance imposed by patch-aligned positional encodings, a phenomenon now widely observed across many ViT backbones and pretraining strategies.

2. Adverse Impact on Dense Prediction and Clustering Tasks

The presence of these artifacts negatively affects several classes of vision tasks, especially those requiring pixel-level feature consistency or partitioning:

Semantic segmentation and depth estimation: Grid noise disrupts the feature coherence within small objects, causing constituent patches to diverge toward distinct clusters, thus reducing mean Intersection-over-Union (mIoU) in segmentation or increasing RMSE in depth regression.
Object discovery and clustering: Conventional clustering methods (e.g., K-means, spectral clustering) on raw ViT features often segment along artifact-induced grid lines rather than semantic boundaries, degrading scores such as Adjusted Rand Index (ARI) or feature similarity.

A plausible implication is that naive positional encoding schemes can fundamentally limit the feature locality and geometric fidelity required for competitive performance in dense downstream tasks (Yang et al., 2024).

3. Canonical Denoising Vision Transformer (DVT) Methodology

The canonical DVT workflow, as proposed by (Yang et al., 2024), comprises two distinct stages:

Stage 1: Per-Image Denoising via Neural Fields

The raw ViT output on an image $x$ is regarded as a discrete patch map $y \in \mathbb{R}^{N \times C}$ . Clean, artifact-free features are modeled as a continuous neural field $f_\theta: \mathbb{R}^2 \to \mathbb{R}^C$ , typically a small MLP with spatial coordinate input (normalized to $[0,1]^2$ ).
A cross-view feature consistency loss enforces that $f_\theta$ yields nearly identical features for spatially transformed copies of the same image: $\mathcal{L}_{\mathrm{consistency}} = \sum_{i=1}^N \| f_\theta(u_i + \Delta_i) - f_\theta(u_i) \|_2^2$ , where $u_i$ and $\Delta_i$ describe the patch coordinates and transform-induced shifts, respectively.
The raw ViT features $y$ are reconstructed using $f_\theta$ , a per-image artifact field $y \in \mathbb{R}^{N \times C}$ 0 (dependent only on absolute position), and a local residual network $y \in \mathbb{R}^{N \times C}$ 1: $y \in \mathbb{R}^{N \times C}$ 2.
A two-phase optimization first jointly fits $y \in \mathbb{R}^{N \times C}$ 3 and $y \in \mathbb{R}^{N \times C}$ 4 to minimize feature distance to $y \in \mathbb{R}^{N \times C}$ 5, then refines $y \in \mathbb{R}^{N \times C}$ 6 and $y \in \mathbb{R}^{N \times C}$ 7 under regularized loss terms, freezing $y \in \mathbb{R}^{N \times C}$ 8.

At convergence, $y \in \mathbb{R}^{N \times C}$ 9 yields an artifact-free semantic feature map $f_\theta: \mathbb{R}^2 \to \mathbb{R}^C$ 0 for each input image.

Stage 2: Lightweight Clean-Feature Prediction

A small neural predictor $f_\theta: \mathbb{R}^2 \to \mathbb{R}^C$ 1 (single Transformer block, augmented with a learned positional embedding $f_\theta: \mathbb{R}^2 \to \mathbb{R}^C$ 2) is trained to directly map the raw ViT features to their denoised versions: $f_\theta: \mathbb{R}^2 \to \mathbb{R}^C$ 3.
The prediction loss combines $f_\theta: \mathbb{R}^2 \to \mathbb{R}^C$ 4 (robustness, sparsity) and $f_\theta: \mathbb{R}^2 \to \mathbb{R}^C$ 5 (stability) terms: $f_\theta: \mathbb{R}^2 \to \mathbb{R}^C$ 6.

This two-stage paradigm decouples per-image artifact removal (offline) from a scalable, architecture-agnostic inference-time denoiser requiring negligible additional latency.

4. Empirical Evaluation and Ablation

DVT has demonstrated consistent and significant improvements across diverse ViT backbones (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and dense vision datasets (Yang et al., 2024):

Semantic segmentation (VOC2012 mIoU/ADE20K mIoU): DeiT-III improves from 70.62 → 73.36 and 32.73 → 36.57 (+2.74 and +3.84 absolute), CLIP improves from 77.78 → 79.01 and 40.51 → 41.10.
Depth prediction (NYU-Depth RMSE): Small but consistent reductions in error for key models.
Ablations:
- Omitting the artifact field $f_\theta: \mathbb{R}^2 \to \mathbb{R}^C$ 7 (i.e., removing explicit per-location bias modeling) drops segmentation mIoU by ~3 points.
- The single-Transformer block denoiser outperforms lightweight 1×1 or 3×3 convolutional alternatives by over 1.5 points in mIoU.

DVTs are model-agnostic, requiring no retraining of the base ViT, and introduce negligible (<5%) inference overhead (Yang et al., 2024).

5. Impact on Transformer Design and Generalization

The identification of grid artifacts introduces fundamental questions about ViT positional encoding design:

Interpretability: Artifact fields are objectively measurable by feeding zero-inputs and observing persistent spatial structure; such artifacts are not random but induced by architectural choices.
Warning for naive PEs: Fixed or learnable additive PEs leave models vulnerable to input-independent grid noise; explorations into relative PEs (e.g., RoPE, ALiBi) or structured attention biases are incentivized to avoid this pathology.
Robustness and transfer: The DVT methodology does not rely on dataset-specific assumptions and generalizes across self-supervised, supervised, and distilled backbones, indicating a widespread nature of this artifact in transformer-based vision models.

A plausible implication is that current state-of-the-art dense vision benchmarks likely underestimate the true potential of ViT representations due to entangled spatial artifacts (Yang et al., 2024).

The DVT paradigm has motivated a broader class of transformer-based denoising networks:

Dense Residual Denoising Transformers (e.g., DenSformer) leverage deep window-based self-attention with dense skip connections for local-global fusion, optimizing for pixel-level $f_\theta: \mathbb{R}^2 \to \mathbb{R}^C$ 8 or $f_\theta: \mathbb{R}^2 \to \mathbb{R}^C$ 9 error on noisy images (Yao et al., 2022).
Domain-specific DVTs: In medical imaging, convolution-free architectures (e.g., CTformer) employ token-to-token dilations and cyclic shifts, avoiding fixed local convolutions entirely while using transformer attention to integrate context (Wang et al., 2022).
Denoising in Generative Diffusion Models: Denoising Vision Transformers such as DiffiT employ time-conditioned self-attention mechanisms to recover clean data from iteratively noised signals, establishing that denoising and token affinity can be directly generative if instrumented carefully (Hatamizadeh et al., 2023).
Causal Regularization and Feature Disentanglement: Recent advances (e.g., TCD-Net) involve explicit environmental bias adjustment modules and orthogonal subspace constraints to geometrically and causally separate content from noise within ViT feature space, leading to improved robustness under distribution shift (Jiang et al., 1 Mar 2026).

In sum, DVT remains an active research area, encompassing both architectural corrections for artifact removal and fundamental inquiries into transformer-based spatial representation learning.

7. Open Problems and Future Directions

Key challenges and prospective future work identified include:

End-to-end artifact-free pretraining: Integrating denoising constraints or artifact suppression directly into the ViT pretraining objective, eliminating the need for post-hoc denoising.
Zero-parameter denoising: Designing positional encoding mechanisms or model reparameterizations that remove grid artifacts without additional inference time computations or denoising heads.
Artifact origin analysis: Further investigation of the interactions between supervised, self-supervised, and distilled objectives in learning or suppressing spatial artifacts, with the goal of informing new transformer architectures resilient against feature-space noise.
Extension beyond images: The principle of DVTs is being extended to video, medical imaging, and coordinate regression tasks (e.g., object tracking via "in-model" diffusion in DeTrack (Zhou et al., 5 Jan 2025)), indicating broad applicability.

The cumulative evidence underscores that denoising is not merely a superficial patch, but an essential ingredient for harnessing the full discriminative and generative power of transformer-based vision models (Yang et al., 2024, Yao et al., 2022, Jiang et al., 1 Mar 2026).