- The paper demonstrates that zero-ablation misleads register content dependence assessments by introducing destructive distributional shifts.
- It compares zeroing registers with mean, noise, and shuffling replacements to show minimal task performance loss under plausible conditions.
- The study highlights registers' role in buffering dense feature extraction and compressing patch geometry, with consistent effects across scales.
Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers
Introduction
The paper "Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers" (2604.14433) rigorously interrogates the methodology of token ablation in Vision Transformers (ViTs), particularly concerning register tokens in DINO-family models. The prevailing interpretation, based on zero-ablation (replacing activations with zero vectors), has been that register tokens are functionally indispensable for downstream vision tasks. However, this work systematically demonstrates that such interpretation is confounded by the out-of-distribution effects introduced by zero vectors, not by an intrinsic dependence on register content.
Figure 1: Approach overview—ViT-S/B models, register ablation protocols, and replacement controls for probing register dependence in global and dense tasks.
Methodology
Three ViT model configurations are analyzed: DINOv2 (no registers), DINOv2+registers, and DINOv3 (registers plus Gram-anchored distillation), each evaluated at ViT-S and ViT-B scales. Ablation is applied by zeroing either [CLS] or register token hidden states after each transformer block, and compared against three distributionally plausible replacements:
- Mean-substitution: Layerwise dataset mean activations
- Noise-substitution: Gaussian noise matched to mean/variance
- Register-shuffling: Cross-image register permutation within batch
Evaluation metrics span both global (classification, retrieval) and dense (correspondence, segmentation) tasks. All interventions are injected at every block, forcing the network to process altered token trajectories throughout.
Results
Zero-Ablation Observations
Zero-ablation leads to severe performance degradation. In DINOv3, zeroing registers yields a classification drop of −36.6\,pp and segmentation drop of −30.9\,pp. In DINOv2, zeroing [CLS] severely impairs all tasks and exposes patch reliance on [CLS]. With registers present, dense-task sensitivity to [CLS] zeroing vanishes; the attention artifacts are effectively absorbed by registers.
Figure 2: Task × Ablation heatmap and patch geometry compression. Zeroing registers causes large drops; plausible replacements preserve task performance.
Plausible Replacements and Distributional Shift
Contrary to zero-ablation, all three plausible replacements (mean, noise, shuffle) preserve classification and segmentation within ≤1\,pp of baseline, despite substantial perturbations to internal states ($0.95$--$0.999$ per-patch cosine similarity). Only zeroing causes catastrophic divergence in attention flow (Jensen–Shannon divergence up to $0.18$ at last layer vs. $0.001$ for mean-substitution), confirming that observed deficits are a consequence of distributional shift.
Figure 3: PCA projection of patch features under ablation. Zero-register ablation drastically reorganizes feature space; zero-CLS ablation is buffered by registers.
Figure 4: Attention flow across layers. Register attention builds gradually, but functional dependence for classification emerges abruptly in late layers.
Qualitative and Quantitative Effects
The qualitative correspondence analysis shows that zeroing registers destroys spatial consistency and correspondence accuracy, while plausible replacements and zeroing CLS maintain correspondence integrity.
Figure 5: Qualitative patch correspondence. Zero-register ablation collapses correspondence accuracy; zero-CLS with registers is robust.
Structural Role of Registers
Registers strongly compress patch geometry (effective rank reduced by 36% in DINOv2+reg, 54% in DINOv3 relative to DINOv2). Compression is established early, but register dependence for classification emerges only at late layers. The attention routing corroborates structural separation: DINOv3 [CLS] directs −30.90 attention to registers in the final layer.
Figure 6: Patch compression and register dependence across layers. Compression develops early; task dependence is layer-specific.
Register tokens buffer dense features from [CLS] dependence. The findings replicate at ViT-B scale, with identical ablation and replacement behavior.
Figure 7: Scale comparison of ViT-S vs. ViT-B. Ablation delta patterns consistent across scales.
Figure 8: Task performance layerwise. Classification emerges late, patch correspondence peaks mid-network.
Figure 9: CLS attention per token type. Routing to registers robust under plausible replacements.
Discussion
Zero-ablation produces outsized deficits because zero vectors occupy a distinct activation region, leading to distributional shift and cascading network disruption. The presence of register tokens reorganizes computation such that dense tasks become insensitive to [CLS] and robust to plausible content perturbation. This challenges the zero-ablation paradigm, shifting focus to in-distribution controls (mean, noise, shuffling) as the proper metric for functional assessment.
Methodologically, the work rigorously applies mechanistic interpretability principles from NLP (activation patching, resample ablation) to vision models, setting new standards for causal probing in ViTs. Future research should extend these controls to adapted or fine-tuned regimes and non-register token types.
Implications and Future Directions
The demonstrated disconnect between zero-ablation outcomes and genuine content dependence necessitates re-evaluation of functional interpretation protocols. Practically, register addition reliably buffers dense feature quality, and structural compression facilitates improved spatial task performance. Theoretically, the study motivates controlled ablation of global computation pathways and refinement of token routing analysis.
Potential avenues include probing the functional role of registers under transfer learning, adaptation, or generative settings; dissecting contributions of Gram anchoring, patch size, and positional encoding; and extending in-distribution ablation methodology to broader Transformer architectures.
Conclusion
Zero-ablation overstates register dependence in DINO-family ViTs by injecting destructive out-of-distribution vectors. Performance in frozen-feature evaluations depends only on plausible register activations, not exact image-specific content. Registers serve a robust structural role—buffering dense features from [CLS] and compressing patch geometry—but task necessity for exact content is absent. These conclusions are validated across model scales and downstream tasks.