Vision Transformers Need Registers (2309.16588v2)

Published 28 Sep 2023 in cs.CV

Abstract: Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

View on arXiv

Authors (4)

Timothée Darcet (5 papers)
Maxime Oquab (15 papers)
Julien Mairal (98 papers)
Piotr Bojanowski (50 papers)

Citations (200)

View on Semantic Scholar

Summary

Vision Transformers Need Registers

Introduction

The paper investigates artifacts in Vision Transformer (ViT) feature maps and proposes a solution by introducing additional register tokens. Vision Transformers have demonstrated efficacy in learning visual representations, outperforming traditional Convolutional Neural Networks (CNNs) in certain visual tasks. However, both supervised and self-supervised ViT networks introduce high-norm tokens during inference, primarily in low-informative image areas.

Key Findings

Artifact Identification: High-norm tokens appear mainly in low-informative background areas, repurposing them for internal computations.
Proposed Solution: Introducing additional tokens in the input sequence significantly reduces the emergence of high-norm artifacts, thus enhancing the quality of feature and attention maps.
Generalizability: This solution consistently improves performance across supervised and self-supervised models.

Artifact Characterization

ViT networks, both supervised like DeiT-III and self-supervised models like DINO and DINOv2, exhibit artifacts in the form of high-norm tokens in feature maps. This phenomenon becomes more pronounced with model size and training duration. These high-norm tokens:

Emerge predominantly in the middle layers of the ViT.
Hold significantly less local information but more global information, which is counterproductive for tasks requiring detailed spatial understanding.

Methodology

The paper introduces register tokens to the input sequence, acting as placeholders for internal computations, thus preventing the high-norm artifacts from appearing in regular image patch tokens. Both qualitative and quantitative evaluation demonstrates that:

Attention Maps: The inclusion of registers results in smoother and more interpretable attention maps.
Feature Maps: Principal component analysis of feature maps from models with registers shows markedly fewer artifacts.

Experimental Setup

The paper ran experiments across various ViT models:

Supervised Learning: DeiT-III trained on ImageNet-22k.
Text-Supervised Learning: OpenCLIP trained on a text-image-aligned corpus.
Self-Supervised Learning: DINOv2 using ImageNet-22k as the dataset.

Performance assessments included linear probing on ImageNet classification, ADE20k segmentation, and monocular depth estimation on NYUd.

Results

Quantitative Analysis

Dense Prediction Tasks: The introduction of registers effectively removed norm outliers and improved downstream task performance.
Object Discovery: Specifically, in tasks like object discovery using LOST, models with registers significantly outperform those without.

Comparison Across Models

For supervised and self-supervised models, the following table summarizes improvements:

| Model        | Metric                | Without Registers | With Registers |
|--|--|-|-|
| DeiT-III     | ImageNet Top-1 (%)    | 84.7              | 84.7           |
|              | ADE20k (mIoU)         | 38.9              | 39.1           |
|              | NYUd (rmse)           | 0.511             | 0.512          |
| OpenCLIP     | ImageNet Top-1 (%)    | 78.2              | 78.1           |
|              | ADE20k (mIoU)         | 26.6              | 26.7           |
|              | NYUd (rmse)           | 0.702             | 0.661          |
| DINOv2       | ImageNet Top-1 (%)    | 84.3              | 84.8           |
|              | ADE20k (mIoU)         | 46.6              | 47.9           |
|              | NYUd (rmse)           | 0.378             | 0.366          |

Implications and Future Work

This research demonstrates the importance of handling internal tensor operations within ViTs to preserve spatial and local information essential for various visual tasks. The introduction of register tokens offers a straightforward yet effective mechanism to improve the interpretability and performance of ViT models without altering the output structure or incurring significant computational overhead.

Future directions might include:

Granular Analysis: Understanding the specific conditions under which high-norm artifacts form, especially in larger datasets and different training regimes.
Advanced Register Mechanisms: Exploring dynamic registers that can adapt based on the input complexity or other heuristic criteria, potentially enhancing model flexibility.
Broader Applications: Testing the efficacy of registers in other transformer-based architectures and diverse visual tasks.

Conclusion

The identification and mitigation of high-norm artifacts in ViT models mark a pivotal step towards enhancing the robustness and versatility of these architectures in visual tasks. The proposed introduction of register tokens not only resolves artifact issues but also enhances performance in dense prediction and object discovery tasks, making it a valuable contribution to the optimization of Vision Transformers.