Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice (2203.05962v1)

Published 9 Mar 2022 in cs.CV and cs.LG

Abstract: Vision Transformer (ViT) has recently demonstrated promise in computer vision problems. However, unlike Convolutional Neural Networks (CNN), it is known that the performance of ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity. Despite a couple of empirical solutions, a rigorous framework studying on this scalability issue remains elusive. In this paper, we first establish a rigorous theory framework to analyze ViT features from the Fourier spectrum domain. We show that the self-attention mechanism inherently amounts to a low-pass filter, which indicates when ViT scales up its depth, excessive low-pass filtering will cause feature maps to only preserve their Direct-Current (DC) component. We then propose two straightforward yet effective techniques to mitigate the undesirable low-pass limitation. The first technique, termed AttnScale, decomposes a self-attention block into low-pass and high-pass components, then rescales and combines these two filters to produce an all-pass self-attention matrix. The second technique, termed FeatScale, re-weights feature maps on separate frequency bands to amplify the high-frequency signals. Both techniques are efficient and hyperparameter-free, while effectively overcoming relevant ViT training artifacts such as attention collapse and patch uniformity. By seamlessly plugging in our techniques to multiple ViT variants, we demonstrate that they consistently help ViTs benefit from deeper architectures, bringing up to 1.1% performance gains "for free" (e.g., with little parameter overhead). We publicly release our codes and pre-trained models at https://github.com/VITA-Group/ViT-Anti-Oversmoothing.

PDF Abstract

Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice

The paper "Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice" explores the scalability challenges associated with Vision Transformers (ViTs) and proposes novel methodologies to mitigate these challenges through Fourier domain analysis. This research stems from the observation that ViTs, unlike Convolutional Neural Networks (CNNs), quickly encounter performance saturation as their depth increases. This is attributed to attention collapse and patch uniformity issues. ViTs' unique architecture—relying heavily on the self-attention mechanism—necessitates a deeper understanding of the inherent characteristics of attention layers, particularly in terms of their filtering effects on input signals.

Theoretical Framework

The authors provide a rigorous theoretical analysis of ViT features in the Fourier spectrum domain. It is demonstrated that the self-attention mechanism essentially acts as a low-pass filter, which means that, as the depth of the ViT increases, the model increasingly retains only the Direct-Current (DC) component of the feature maps, losing higher frequency features. This theoretical finding suggests a significant limitation in the scaling of ViTs, as deeper architectures could result in oversmoothing—a phenomenon where distinct patches lose their unique characteristics and become indistinguishable.

Proposed Solutions

To address the limitations of low-pass filtering in self-attention layers, two techniques are introduced: AttnScale and FeatScale.

AttnScale: This technique decomposes the self-attention block into low-pass and high-pass components. The low-pass effect is a predictable collapse towards uniformity, while the high-pass component enhances feature diversification. By applying a learnable weighting function to these components, AttnScale effectively transforms the self-attention filter into an all-pass filter, preserving a broader range of frequency signals and thereby maintaining patch diversity.
FeatScale: This technique operates directly on feature maps by re-weighting different frequency bands to amplify high-frequency signals. It employs trainable coefficients to adjust the strength of various frequency components, thus allowing ViTs to maintain rich feature representations even as depth increases.

Both techniques are designed to be computationally efficient and require minimal hyperparameter tuning, making them practical for application across various ViT configurations.

Empirical Results

The application of these techniques to multiple ViT variants, including DeiT, CaiT, and Swin-Transformers, demonstrates their efficacy. The paper reports up to a 1.1% improvement in performance for deeper architectures, achieved with negligible additional computational overhead. These improvements suggest that the proposed scaling methods can effectively overcome training artifacts such as attention collapse and patch uniformity without relying on extensive parameter tuning or complex architectures.

Implications and Future Work

This research provides a new lens through which the scalability of ViTs can be examined, highlighting the significance of addressing oversmoothing issues from a spectral perspective. The implications are twofold:

Theoretical: The grounding of attention mechanisms in Fourier analysis offers pathways to explore more general implications of signal processing principles in neural network design, potentially influencing future AI model architectures beyond vision applications.
Practical: The methodologies proposed could see broader adoption in the context of training deeper models without suffering the performance degradation typically associated with increased depth.

Future work may explore extending these techniques to other domains where self-attention mechanisms are critical, such as NLP transformers, and continue refining the approach to spectral-domain feature scaling for enhanced model robustness across diverse datasets and tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Peihao Wang (43 papers)
Wenqing Zheng (16 papers)
Tianlong Chen (202 papers)
Zhangyang Wang (375 papers)

Citations (109)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - VITA-Group/ViT-Anti-Oversmoothing: [ICLR 2022] "Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice" by Peihao Wang, Wenqing Zheng, Tianlong Chen, Zhangyang Wang (75 stars)