On the Relationship between Self-Attention and Convolutional Layers (1911.03584v2)

Published 8 Nov 2019 in cs.LG, cs.CL, cs.CV, and stat.ML

Abstract: Recent trends of incorporating attention mechanisms in vision have led researchers to reconsider the supremacy of convolutional layers as a primary building block. Beyond helping CNNs to handle long-range dependencies, Ramachandran et al. (2019) showed that attention can completely replace convolution and achieve state-of-the-art performance on vision tasks. This raises the question: do learned attention layers operate similarly to convolutional layers? This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis. Our code is publicly available.

Citations (496)

View on Semantic Scholar

Summary

The paper establishes that multi-head self-attention can fully replicate any convolutional layer's function under certain conditions.
It leverages relative positional encoding to mirror convolutional receptive fields within attention mechanisms.
Empirical results reveal that early self-attention layers naturally adopt grid-like patterns, achieving competitive performance on visual tasks.

Introduction

The emergence of self-attention as a core component in Transformer architectures has revolutionized the field of NLP, particularly due to the prevalence of models such as BERT and GPT-2. A clear distinction has been drawn between these attention-based methods and traditional Convolutional Neural Networks (CNNs). However, recent studies suggest that the abilities of self-attention may extend beyond sequence modeling, challenging the dominance of convolution in computer vision. Notably, the replacement of convolutional layers with self-attention layers has achieved competitive and, in some cases, state-of-the-art results on visual tasks.

Theoretical Insights

A significant contribution of this research lies in theoretically establishing that self-attention layers are not just a mere substitute for convolutional mechanisms, but in fact, under certain conditions, can wholly replicate them. Leveraging a multi-head self-attention layer with a sufficiently large number of heads, the authors present a constructive proof that any convolutional layer's function can indeed be captured. This revelation is rooted in the exploration of relative positional encoding within self-attention models, which traditionally aid in distinguishing positional relationships between different inputs, analogous to the receptor field of convolutional filters.

Empirical Analysis

Complementing the theoretical groundwork, the empirical analysis conducted by the authors addresses how self-attention layers realize their theoretical potential in practice. The experiments underscore that early self-attention layers within an attention-only architecture adapt and attend in grid-like patterns akin to CNN layers. This suggests that the inductive bias for localized convolution is naturally discovered by self-attention architectures. To support reproducibility of results, the code for these experiments has been made public.

Implications on Vision Task Processing

Deepening our understanding of how self-attention functions in vision applications, the research offers an interactive platform for investigating the tendencies of self-attention models. The analysis draws attention to the bifurcation in attention strategy - lower layers prioritize position-based local attention, akin to classical convolutional processing, while deeper layers exhibit content-based global attention, leveraging the full capabilities of the attention mechanism.

Conclusion

This paper marks a critical intersection in the understanding of self-attention and convolutional layers, prompting the machine learning community to reassess the distinct roles of these mechanisms within the field of vision-based tasks. By proving the expressive power of self-attention layers to be at least equivalent to that of convolutional layers, the work opens the door to a new landscape where self-attention could potentially serve as a universal building block across a wide spectrum of tasks - a prospect that carries promising implications for the architecture of future models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/nasim_rahaman/status/1758379953014329805

https://twitter.com/Pranav2278/status/1768261112066179094

YouTube

Show All Videos