Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers

Published 15 Apr 2026 in cs.CV and cs.LG | (2604.14433v1)

Abstract: Zero-ablation -- replacing token activations with zero vectors -- is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to $-36.6$\,pp classification, $-30.9$\,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls -- mean-substitution, noise-substitution, and cross-image register-shuffling -- preserve performance across classification, correspondence, and segmentation, remaining within ${\sim}1$\,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt{[CLS]} dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.

Summary

  • The paper demonstrates that zero-ablation misleads register content dependence assessments by introducing destructive distributional shifts.
  • It compares zeroing registers with mean, noise, and shuffling replacements to show minimal task performance loss under plausible conditions.
  • The study highlights registers' role in buffering dense feature extraction and compressing patch geometry, with consistent effects across scales.

Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers

Introduction

The paper "Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers" (2604.14433) rigorously interrogates the methodology of token ablation in Vision Transformers (ViTs), particularly concerning register tokens in DINO-family models. The prevailing interpretation, based on zero-ablation (replacing activations with zero vectors), has been that register tokens are functionally indispensable for downstream vision tasks. However, this work systematically demonstrates that such interpretation is confounded by the out-of-distribution effects introduced by zero vectors, not by an intrinsic dependence on register content. Figure 1

Figure 1: Approach overview—ViT-S/B models, register ablation protocols, and replacement controls for probing register dependence in global and dense tasks.

Methodology

Three ViT model configurations are analyzed: DINOv2 (no registers), DINOv2+registers, and DINOv3 (registers plus Gram-anchored distillation), each evaluated at ViT-S and ViT-B scales. Ablation is applied by zeroing either [CLS] or register token hidden states after each transformer block, and compared against three distributionally plausible replacements:

  • Mean-substitution: Layerwise dataset mean activations
  • Noise-substitution: Gaussian noise matched to mean/variance
  • Register-shuffling: Cross-image register permutation within batch

Evaluation metrics span both global (classification, retrieval) and dense (correspondence, segmentation) tasks. All interventions are injected at every block, forcing the network to process altered token trajectories throughout.

Results

Zero-Ablation Observations

Zero-ablation leads to severe performance degradation. In DINOv3, zeroing registers yields a classification drop of −36.6-36.6\,pp and segmentation drop of −30.9-30.9\,pp. In DINOv2, zeroing [CLS] severely impairs all tasks and exposes patch reliance on [CLS]. With registers present, dense-task sensitivity to [CLS] zeroing vanishes; the attention artifacts are effectively absorbed by registers. Figure 2

Figure 2: Task ×\times Ablation heatmap and patch geometry compression. Zeroing registers causes large drops; plausible replacements preserve task performance.

Plausible Replacements and Distributional Shift

Contrary to zero-ablation, all three plausible replacements (mean, noise, shuffle) preserve classification and segmentation within ≤1{\leq}1\,pp of baseline, despite substantial perturbations to internal states ($0.95$--$0.999$ per-patch cosine similarity). Only zeroing causes catastrophic divergence in attention flow (Jensen–Shannon divergence up to $0.18$ at last layer vs. $0.001$ for mean-substitution), confirming that observed deficits are a consequence of distributional shift. Figure 3

Figure 3: PCA projection of patch features under ablation. Zero-register ablation drastically reorganizes feature space; zero-CLS ablation is buffered by registers.

Figure 4

Figure 4: Attention flow across layers. Register attention builds gradually, but functional dependence for classification emerges abruptly in late layers.

Qualitative and Quantitative Effects

The qualitative correspondence analysis shows that zeroing registers destroys spatial consistency and correspondence accuracy, while plausible replacements and zeroing CLS maintain correspondence integrity. Figure 5

Figure 5: Qualitative patch correspondence. Zero-register ablation collapses correspondence accuracy; zero-CLS with registers is robust.

Structural Role of Registers

Registers strongly compress patch geometry (effective rank reduced by 36%36\% in DINOv2+reg, 54%54\% in DINOv3 relative to DINOv2). Compression is established early, but register dependence for classification emerges only at late layers. The attention routing corroborates structural separation: DINOv3 [CLS] directs −30.9-30.90 attention to registers in the final layer. Figure 6

Figure 6: Patch compression and register dependence across layers. Compression develops early; task dependence is layer-specific.

Task Performance Across Layers and Scales

Register tokens buffer dense features from [CLS] dependence. The findings replicate at ViT-B scale, with identical ablation and replacement behavior. Figure 7

Figure 7: Scale comparison of ViT-S vs. ViT-B. Ablation delta patterns consistent across scales.

Figure 8

Figure 8: Task performance layerwise. Classification emerges late, patch correspondence peaks mid-network.

Figure 9

Figure 9: CLS attention per token type. Routing to registers robust under plausible replacements.

Discussion

Zero-ablation produces outsized deficits because zero vectors occupy a distinct activation region, leading to distributional shift and cascading network disruption. The presence of register tokens reorganizes computation such that dense tasks become insensitive to [CLS] and robust to plausible content perturbation. This challenges the zero-ablation paradigm, shifting focus to in-distribution controls (mean, noise, shuffling) as the proper metric for functional assessment.

Methodologically, the work rigorously applies mechanistic interpretability principles from NLP (activation patching, resample ablation) to vision models, setting new standards for causal probing in ViTs. Future research should extend these controls to adapted or fine-tuned regimes and non-register token types.

Implications and Future Directions

The demonstrated disconnect between zero-ablation outcomes and genuine content dependence necessitates re-evaluation of functional interpretation protocols. Practically, register addition reliably buffers dense feature quality, and structural compression facilitates improved spatial task performance. Theoretically, the study motivates controlled ablation of global computation pathways and refinement of token routing analysis.

Potential avenues include probing the functional role of registers under transfer learning, adaptation, or generative settings; dissecting contributions of Gram anchoring, patch size, and positional encoding; and extending in-distribution ablation methodology to broader Transformer architectures.

Conclusion

Zero-ablation overstates register dependence in DINO-family ViTs by injecting destructive out-of-distribution vectors. Performance in frozen-feature evaluations depends only on plausible register activations, not exact image-specific content. Registers serve a robust structural role—buffering dense features from [CLS] and compressing patch geometry—but task necessity for exact content is absent. These conclusions are validated across model scales and downstream tasks.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.