KVT: k-NN Attention for Boosting Vision Transformers (2106.00515v3)

Published 28 May 2021 in cs.CV

Abstract: Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision transformer architectures have been proposed and they show promising performance. A key component in vision transformers is the fully-connected self-attention which is more powerful than CNNs in modelling long range dependencies. However, since the current dense self-attention uses all image patches (tokens) to compute attention matrix, it may neglect locality of images patches and involve noisy tokens (e.g., clutter background and occlusion), leading to a slow training process and potential degradation of performance. To address these problems, we propose the $k$-NN attention for boosting vision transformers. Specifically, instead of involving all the tokens for attention matrix calculation, we only select the top-$k$ similar tokens from the keys for each query to compute the attention map. The proposed $k$-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations, as nearby tokens tend to be more similar than others. In addition, the $k$-NN attention allows for the exploration of long range correlation and at the same time filters out irrelevant tokens by choosing the most similar tokens from the entire image. Despite its simplicity, we verify, both theoretically and empirically, that $k$-NN attention is powerful in speeding up training and distilling noise from input tokens. Extensive experiments are conducted by using 11 different vision transformer architectures to verify that the proposed $k$-NN attention can work with any existing transformer architectures to improve its prediction performance. The codes are available at \url{https://github.com/damo-cv/KVT}.

PDF Abstract

Analysis of $k$ -NN Attention for Vision Transformers

In their paper, Wang et al. propose a novel attention mechanism named $k$ -NN attention to address notable limitations associated with fully-connected self-attention in vision transformers. Traditional self-attention excels at modeling long-range dependencies but often sacrifices locality and is susceptible to including irrelevant information, such as noisy tokens from cluttered backgrounds and occlusions. The $k$ -NN attention mechanism aims to bridge the properties of both Convolutional Neural Networks (CNNs) and vision transformers by incorporating local bias without convolutional operations and filtering out irrelevant image patches.

Methodology Overview

The essence of $k$ -NN attention lies in the selection of a limited number of top- $k$ tokens during attention map calculation for each query. Simplifying the process, the Fast Version calculates the dot product of all tokens using matrix multiplication and selects the top- $k$ elements row-wise, drastically reducing computational complexity compared to conventional attention techniques.

Theoretical Justifications

The authors support the efficacy of $k$ -NN attention through substantial theoretical analysis. Key observations from their theoretical investigations include:

Convergence Speed: The $k$ -NN attention mechanism demonstrates faster convergence during training compared to the fully-connected self-attention model. This improvement is attributed to the reduced variance in long-range dependencies brought by excluding less relevant tokens. It results in diminished gradient scale, enabling quicker stabilization during optimization.
Noise Distillation: Through rigorous analysis, the authors establish that the $k$ -NN attention can more effectively isolate noisy tokens, with empirical evidence suggesting a smaller distance between the computed token representations and their ground truth mean.
Patch Selection: The ability of $k$ -NN attention to select significant tokens for computations while discarding irrelevant ones is further validated through their analytical framework. This selective attention enhances robustness and accuracy in scenarios with cluttered or occluded image regions.

Empirical Validation

The researchers applied the $k$ -NN attention mechanism across eleven vision transformer architectures. Performance improvements, as observed in ImageNet-1K classification tasks, ranged from 0.2% to 0.8%. Notably, the improvements are particularly significant in the beginning phases of training, affirming the theoretical claims of accelerated convergence.

Moreover, the impact of the size of $k$ —the number of selected tokens—was thoroughly analyzed across different architectures, facilitating optimal configuration strategies to balance local versus global attention needs.

Practical and Theoretical Implications

The $k$ -NN attention setup promises significant benefits in practical applications where faster training and enhanced robustness in noisy environments are desirable. The authors demonstrate applicability in sophisticated tasks such as object detection and semantic segmentation, achieving notable improvements with minimal performance overhead.

The theoretical insights provide a fundamental understanding of attention dynamics, encouraging further exploration into selective attention mechanisms within AI models. As the intricacies of vision data challenge conventional methods, $k$ -NN attention serves as a beacon for optimizing token selection dynamically.

Future Prospects

Future work may delve into adaptive mechanisms for determining the optimal parameter $k$ based on the variability of image data and task complexity. Additionally, integrating $k$ -NN attention into multi-modal architectures may unveil novel cross-domain capabilities, potentially extending its impact into areas such as video understanding and 3D perception.

Overall, Wang et al.'s paper provides a rich blend of theoretical rationale and empirical success, paving the way for substantial advancements in vision transformer methodologies. This work establishes a solid foundation upon which future transformer models can build more nuanced and computationally efficient attention layers, driving the field of computer vision forward.