Stand-Alone Self-Attention in Vision Models (1906.05909v1)

Published 13 Jun 2019 in cs.CV

Abstract: Convolutions are a fundamental building block of modern computer vision systems. Recent approaches have argued for going beyond convolutions in order to capture long-range dependencies. These efforts focus on augmenting convolutional models with content-based interactions, such as self-attention and non-local means, to achieve gains on a number of vision tasks. The natural question that arises is whether attention can be a stand-alone primitive for vision models instead of serving as just an augmentation on top of convolutions. In developing and testing a pure self-attention vision model, we verify that self-attention can indeed be an effective stand-alone layer. A simple procedure of replacing all instances of spatial convolutions with a form of self-attention applied to ResNet model produces a fully self-attentional model that outperforms the baseline on ImageNet classification with 12% fewer FLOPS and 29% fewer parameters. On COCO object detection, a pure self-attention model matches the mAP of a baseline RetinaNet while having 39% fewer FLOPS and 34% fewer parameters. Detailed ablation studies demonstrate that self-attention is especially impactful when used in later layers. These results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox.

Authors (6)

Prajit Ramachandran (11 papers)
Niki Parmar (17 papers)
Ashish Vaswani (23 papers)
Irwan Bello (12 papers)
Anselm Levskaya (8 papers)
Jonathon Shlens (58 papers)

Citations (1,123)

View on Semantic Scholar

Summary

Stand-Alone Self-Attention in Vision Models

The paper "Stand-Alone Self-Attention in Vision Models" by Ramachandran et al. explores the viability of self-attention as a primary mechanism in vision models, specifically focusing on replacing traditional convolutional neural networks (CNNs) with fully attentional architectures. The central premise is that while convolutions have been the cornerstone of computer vision tasks, they struggle to capture long-range dependencies effectively. The authors propose a model that entirely substitutes spatial convolutions with self-attention mechanisms, thus presenting a method to address this limitation.

Key Findings

The authors present substantial evidence to support the effectiveness of using stand-alone self-attention in vision models as a complete replacement for convolutional layers. Extensive experimentation on prominent datasets such as ImageNet and COCO reveals that these attentional models not only match but often surpass their convolutional counterparts in various metrics, including model accuracy, parameter count, and FLOPs (floating point operations).

ImageNet Classification:
- A fully attentional model derived from a ResNet-50 architecture achieves a 0.5% higher accuracy than a baseline ResNet-50.
- This model achieves comparable performance with 12% fewer FLOPs and 29% fewer parameters.
- These performance gains are consistent across different depth and width variations of the ResNet-50 model.
COCO Object Detection:
- Using a pure self-attention backbone in RetinaNet matches the mean Average Precision (mAP) of the convolutional baseline while achieving a reduction in parameter count by 22%.
- Extending self-attention across the Feature Pyramid Network (FPN) and detection heads results in models with 34% fewer parameters and 39% fewer FLOPs, without compromising mAP.

Ablation Studies

The paper extensively explores various configurations to understand the impacts of different components in the self-attention mechanism:

Layer Group Utilization: Convolutional layers were found to be more effective in the early stages of the network (stem and initial layers), with stand-alone self-attention layers providing better performance in later stages. This hybrid approach yielded the best overall performance.
Positional Encoding: Relative positional encodings significantly outperformed both absolute positional encodings and models with no positional information, highlighting the importance of spatial context in vision tasks.
Spatial Extent: While increasing the spatial extent of self-attention (k) consistently improved performance, the benefits plateaued around k=11, indicating diminishing returns beyond a certain receptive field size.

Practical and Theoretical Implications

The implications of this research are manifold:

Practical Advantages:
- The reduction in FLOPs and parameters suggests that fully attentional networks could be more efficient for model training and inference. However, the paper also notes that current hardware optimizations favor convolutional operations, leading to slower wall-clock times for self-attention models.
- This opens avenues for future work aimed at optimizing self-attention operations on hardware accelerators.
Theoretical Insights:
- Self-attention's ability to model long-distance interactions more effectively than convolutions provides a compelling argument for its wider adoption in vision architectures.
- The demonstrated advantage of stand-alone self-attention in capturing global dependencies hints at a paradigm shift in how architectures are designed for vision tasks.

Future Directions

Given the robust performance of fully attentional models, future research could explore more advanced and hybrid architectures that leverage both self-attention and convolutional layers at different network stages. There is also potential for:

Architecture Search: Optimizing architectural design specifically for self-attention components rather than adapting convolutional networks.
New Attention Mechanisms: Developing novel self-attention forms that could capture intricate spatial details in early layers, imitating the edge-detection capabilities of traditional convolutions.
Extension to Other Tasks: Applying stand-alone self-attention to other computer vision tasks like semantic segmentation, human pose estimation, and instance segmentation.

Conclusion

The paper by Ramachandran et al. provides a thorough evaluation of stand-alone self-attention mechanisms in computer vision, establishing their efficacy and potential benefits. The findings encourage the vision community to reconsider the foundational building blocks of their models, balancing the strengths of both attention and convolution to develop more efficient, effective, and versatile vision systems. The paper's insights could pave the way for innovative architectures that leverage the best of both worlds, potentially redefining standards in the field of computer vision.

PDF Markdown

Related Papers

YouTube

Show All Videos