Vision Transformers Need Registers
Introduction
The paper investigates artifacts in Vision Transformer (ViT) feature maps and proposes a solution by introducing additional register tokens. Vision Transformers have demonstrated efficacy in learning visual representations, outperforming traditional Convolutional Neural Networks (CNNs) in certain visual tasks. However, both supervised and self-supervised ViT networks introduce high-norm tokens during inference, primarily in low-informative image areas.
Key Findings
- Artifact Identification: High-norm tokens appear mainly in low-informative background areas, repurposing them for internal computations.
- Proposed Solution: Introducing additional tokens in the input sequence significantly reduces the emergence of high-norm artifacts, thus enhancing the quality of feature and attention maps.
- Generalizability: This solution consistently improves performance across supervised and self-supervised models.
Artifact Characterization
ViT networks, both supervised like DeiT-III and self-supervised models like DINO and DINOv2, exhibit artifacts in the form of high-norm tokens in feature maps. This phenomenon becomes more pronounced with model size and training duration. These high-norm tokens:
- Emerge predominantly in the middle layers of the ViT.
- Hold significantly less local information but more global information, which is counterproductive for tasks requiring detailed spatial understanding.
Methodology
The paper introduces register tokens to the input sequence, acting as placeholders for internal computations, thus preventing the high-norm artifacts from appearing in regular image patch tokens. Both qualitative and quantitative evaluation demonstrates that:
- Attention Maps: The inclusion of registers results in smoother and more interpretable attention maps.
- Feature Maps: Principal component analysis of feature maps from models with registers shows markedly fewer artifacts.
Experimental Setup
The paper ran experiments across various ViT models:
- Supervised Learning: DeiT-III trained on ImageNet-22k.
- Text-Supervised Learning: OpenCLIP trained on a text-image-aligned corpus.
- Self-Supervised Learning: DINOv2 using ImageNet-22k as the dataset.
Performance assessments included linear probing on ImageNet classification, ADE20k segmentation, and monocular depth estimation on NYUd.
Results
Quantitative Analysis
- Dense Prediction Tasks: The introduction of registers effectively removed norm outliers and improved downstream task performance.
- Object Discovery: Specifically, in tasks like object discovery using LOST, models with registers significantly outperform those without.
Comparison Across Models
For supervised and self-supervised models, the following table summarizes improvements:
1
2
3
4
5
6
7
8
9
10
11
|
| Model | Metric | Without Registers | With Registers |
|--|--|-|-|
| DeiT-III | ImageNet Top-1 (%) | 84.7 | 84.7 |
| | ADE20k (mIoU) | 38.9 | 39.1 |
| | NYUd (rmse) | 0.511 | 0.512 |
| OpenCLIP | ImageNet Top-1 (%) | 78.2 | 78.1 |
| | ADE20k (mIoU) | 26.6 | 26.7 |
| | NYUd (rmse) | 0.702 | 0.661 |
| DINOv2 | ImageNet Top-1 (%) | 84.3 | 84.8 |
| | ADE20k (mIoU) | 46.6 | 47.9 |
| | NYUd (rmse) | 0.378 | 0.366 | |
Implications and Future Work
This research demonstrates the importance of handling internal tensor operations within ViTs to preserve spatial and local information essential for various visual tasks. The introduction of register tokens offers a straightforward yet effective mechanism to improve the interpretability and performance of ViT models without altering the output structure or incurring significant computational overhead.
Future directions might include:
- Granular Analysis: Understanding the specific conditions under which high-norm artifacts form, especially in larger datasets and different training regimes.
- Advanced Register Mechanisms: Exploring dynamic registers that can adapt based on the input complexity or other heuristic criteria, potentially enhancing model flexibility.
- Broader Applications: Testing the efficacy of registers in other transformer-based architectures and diverse visual tasks.
Conclusion
The identification and mitigation of high-norm artifacts in ViT models mark a pivotal step towards enhancing the robustness and versatility of these architectures in visual tasks. The proposed introduction of register tokens not only resolves artifact issues but also enhances performance in dense prediction and object discovery tasks, making it a valuable contribution to the optimization of Vision Transformers.