Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention (2304.04237v1)

Published 9 Apr 2023 in cs.CV

Abstract: Self-attention mechanism has been a key factor in the recent progress of Vision Transformer (ViT), which enables adaptive feature extraction from global contexts. However, existing self-attention methods either adopt sparse global attention or window attention to reduce the computation complexity, which may compromise the local feature learning or subject to some handcrafted designs. In contrast, local attention, which restricts the receptive field of each query to its own neighboring pixels, enjoys the benefits of both convolution and self-attention, namely local inductive bias and dynamic feature selection. Nevertheless, current local attention modules either use inefficient Im2Col function or rely on specific CUDA kernels that are hard to generalize to devices without CUDA support. In this paper, we propose a novel local attention module, Slide Attention, which leverages common convolution operations to achieve high efficiency, flexibility and generalizability. Specifically, we first re-interpret the column-based Im2Col function from a new row-based perspective and use Depthwise Convolution as an efficient substitution. On this basis, we propose a deformed shifting module based on the re-parameterization technique, which further relaxes the fixed key/value positions to deformed features in the local region. In this way, our module realizes the local attention paradigm in both efficient and flexible manner. Extensive experiments show that our slide attention module is applicable to a variety of advanced Vision Transformer models and compatible with various hardware devices, and achieves consistently improved performances on comprehensive benchmarks. Code is available at https://github.com/LeapLabTHU/Slide-Transformer.

Authors (5)

Xuran Pan (14 papers)
Tianzhu Ye (9 papers)
Zhuofan Xia (12 papers)
Shiji Song (103 papers)
Gao Huang (178 papers)

Citations (36)

View on Semantic Scholar

Summary

An In-depth Examination of "Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention"

The paper introduces the "Slide-Transformer," a novel solution for enhancing the efficiency and flexibility of Vision Transformer (ViT) models through a newly proposed local attention mechanism named Slide Attention. The authors address intrinsic challenges in current Transformer-based methodologies, particularly the computational overhead associated with conventional self-attention mechanisms.

Core Contributions

The Slide Attention module emerges as the central contribution of this research, offering an alternative approach to the conventional sparse global and window attention mechanisms. The proposed method exploits local attention's benefits of convolution-like local inductive bias and the adaptive feature selection of self-attention. However, it does so with improved computational efficiency and flexibility. Key innovations include:

Reinterpretation of Im2Col Functionality: The paper reconceptualizes the Im2Col function traditionally used for local attention by applying a row-based perspective instead of a column-based view. This is a pivotal advancement as it leads to the witnessing of feature shifts analogous to standard convolution operations.
Integration with Depthwise Convolution: Replacing the data-inefficient feature shifts with depthwise convolutions transforms the traditionally costly Im2Col process into a streamlined set of operations. This replacement not only makes Slide Attention more hardware-friendly but also keeps it highly efficient.
Deformed Shifting Module: The introduction of a novel deformed shifting module, which incorporates learnable convolution kernels parallel to the fixed ones, serves to increase the model’s capacity for capturing diverse features without compromising efficiency. This enhancement is reinforced through re-parameterization techniques which merge these paths during inference for optimal performance.

Empirical Validation and Results

The empirical analysis demonstrates the efficacy of the Slide Attention module across several advanced ViT architectures, including adaptations with PVT, PVTv2, Swin Transformer, CSwin Transformer, and NAT. Comprehensive evaluations on standard benchmarks such as ImageNet-1K for image classification, ADE20K for semantic segmentation, and COCO for object detection substantiate the model's superior accuracy and efficiency trade-offs. When implemented on resource-constrained devices like iPhone 12 and Metal Performance Shader environments, the Slide Attention exhibits remarkable compatibility and performance improvements, achieving up to 3.9x speedups over existing methods.

Implications and Future Directions

The advancements presented suggest significant potential for transforming how self-attention models operate within the domain of Computer Vision. By circumventing dependency on specialized hardware like CUDA while maintaining competitive performance, Slide Attention lowers the barrier for applying high-capacity ViT models in diverse application scenarios.

Theoretically, the foundational improvements in local receptive field modeling may pave the way for further research into adaptable and efficient attention mechanisms. These could include the exploration of hybridized models that blend convolutional inductive biases with data-driven attention patterns.

Anticipating subsequent work, it is plausible that Slide Attention could inspire novel architectural designs that exploit similar methods of re-interpreting established operations, yielding variants with even greater generalizability and efficiency. This line of inquiry could lead to breakthroughs in mobile AI applications, real-time processing systems, and scalable deployment of Transformer models.

In summary, the Slide-Transformer stands as a significant advance in utilizing local self-attention paradigms for vision tasks, offering both theoretical contributions and practical performance gains. While addressing a key challenge within the Transformer architecture for vision tasks, it solidifies a foundation for future explorations in optimizing attention mechanisms.

PDF Markdown

Related Papers

GitHub

GitHub - LeapLabTHU/Slide-Transformer: Official repository of Slide-Transformer (CVPR2023) (171 stars)