Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Local Self-Attention for Parameter Efficient Visual Backbones (2103.12731v3)

Published 23 Mar 2021 in cs.CV

Abstract: Self-attention has the promise of improving computer vision systems due to parameter-independent scaling of receptive fields and content-dependent interactions, in contrast to parameter-dependent scaling and content-independent interactions of convolutions. Self-attention models have recently been shown to have encouraging improvements on accuracy-parameter trade-offs compared to baseline convolutional models such as ResNet-50. In this work, we aim to develop self-attention models that can outperform not just the canonical baseline models, but even the high-performing convolutional models. We propose two extensions to self-attention that, in conjunction with a more efficient implementation of self-attention, improve the speed, memory usage, and accuracy of these models. We leverage these improvements to develop a new self-attention model family, HaloNets, which reach state-of-the-art accuracies on the parameter-limited setting of the ImageNet classification benchmark. In preliminary transfer learning experiments, we find that HaloNet models outperform much larger models and have better inference performance. On harder tasks such as object detection and instance segmentation, our simple local self-attention and convolutional hybrids show improvements over very strong baselines. These results mark another step in demonstrating the efficacy of self-attention models on settings traditionally dominated by convolutional models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ashish Vaswani (23 papers)
  2. Prajit Ramachandran (11 papers)
  3. Aravind Srinivas (20 papers)
  4. Niki Parmar (17 papers)
  5. Blake Hechtman (12 papers)
  6. Jonathon Shlens (58 papers)
Citations (366)

Summary

Scaling Local Self-Attention for Parameter Efficient Visual Backbones

The paper "Scaling Local Self-Attention for Parameter Efficient Visual Backbones" investigates the potential of self-attention mechanisms in visual systems, particularly focusing on improving parameter efficiency compared to traditional convolutional networks. The authors introduce a novel self-attention model family, HaloNets, which serves as a competitive alternative to SOTA convolutional models, achieving impressive benchmarks on image classification and other vision tasks.

Contextual Challenges and Approach

Historically, convolutions have dominated computer vision, primarily due to their local processing efficiency and ability to learn spatial features like edges and textures. Contrastingly, self-attention, prominent in NLP, facilitates parameter-independent receptive field scaling and dynamic content-based interaction, attributes that are underexploited in vision. The work here identifies that although previous self-attentive models like SASA improved parameter efficiency, they often fell short in competition with top convolutional counterparts due to scalability concerns.

This research explores scaling self-attention, aiming to surpass renowned convolutional architectures. Challenges arise in efficiently implementing 2D self-attention on hardware tailored to convolutions, like TPUs and GPUs, due to a lack of optimized local attention implementations. To address this, the authors utilize block-wise local attention, where image segments or "blocks" share a larger halo of pixels, representing a compromise between memory and computational intensity.

HaloNets Architecture and Implementation

The HaloNet architectures incorporate several innovations:

  • Blocked Local Attention: Dividing images into blocks and utilizing halo regions, the model circumvents the need for exhaustive pixel neighborhoods, thereby conserving memory while retaining extensive receptive fields.
  • Non-Centered Attention: This approach eschews strict translational equivariance in favor of better hardware performance, trading slight theoretical fidelity for practical speed and efficiency.
  • Strided Downsampling: A novel self-attentive downsampling layer improves computation load by embedding stride functionality directly within the attention operation, enhancing the model's scaling ability on larger images.

These architectural components allow HaloNets to efficiently manage the trade-offs between spatial coverage and computational overhead, reaching an 84.9% top-1 accuracy on the ImageNet benchmark — notable for a parameter-efficient model.

Empirical Results and Performance

Experiments underscore that HaloNets match and, in certain settings, outperform EfficientNets, a widely adopted baseline in parameter efficiency and accuracy. HaloNets benefit significantly from increased image sizes due to underlying architectural support for larger receptive fields, maintaining performance superiority over parameter-equivalent ResNets.

In transfer learning scenarios, HaloNets trained on ImageNet-21k maintained competitive edge over both ViT and BiT, signaling robust adaptability in variable data regimes. Furthermore, initial tests on COCO object detection tasks showed that even leagues of HaloNet hybrid models demonstrated mAP improvements over comparably strong baselines.

Theoretical and Practical Implications

Theoretically, this work evidences the viability of self-attention models in vision, traditionally a convolutional stronghold. The research indicates that with appropriately designed architectures, self-attention can achieve or exceed convolutional performance while maintaining parameter efficiency. Practically, these architectural advancements could lead to self-attention models being more widely adopted in real-time and large-scale vision applications, offering better resource utilization with competitive accuracy and speed trade-offs.

In future endeavors, optimizing the speed of pure self-attention models remains imperative. Hybrid models that balance convolution and self-attention layers may exemplify the best of both worlds, facilitating efficiency while leveraging the flexible modeling capacity of attention mechanisms. Continuing research will need to address these optimizations, potentially incorporating automated architecture search to further refine models like HaloNets. As such developments unfold, the application of self-attention in vision will likely expand, challenging and potentially reshaping paradigms within the field.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com