On the Connection between Local Attention and Dynamic Depth-wise Convolution (2106.04263v5)

Published 8 Jun 2021 in cs.CV

Abstract: Vision Transformer (ViT) attains state-of-the-art performance in visual recognition, and the variant, Local Vision Transformer, makes further improvements. The major component in Local Vision Transformer, local attention, performs the attention separately over small local windows. We rephrase local attention as a channel-wise locally-connected layer and analyze it from two network regularization manners, sparse connectivity and weight sharing, as well as weight computation. Sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window. Weight sharing: the connection weights for one position are shared across channels or within each group of channels. Dynamic weight: the connection weights are dynamically predicted according to each image instance. We point out that local attention resembles depth-wise convolution and its dynamic version in sparse connectivity. The main difference lies in weight sharing - depth-wise convolution shares connection weights (kernel weights) across spatial positions. We empirically observe that the models based on depth-wise convolution and the dynamic variant with lower computation complexity perform on-par with or sometimes slightly better than Swin Transformer, an instance of Local Vision Transformer, for ImageNet classification, COCO object detection and ADE semantic segmentation. These observations suggest that Local Vision Transformer takes advantage of two regularization forms and dynamic weight to increase the network capacity. Code is available at https://github.com/Atten4Vis/DemystifyLocalViT.

PDF Abstract

An Examination of the Connection Between Local Attention and Dynamic Depth-wise Convolution

The paper explores the intriguing relationship between local attention mechanisms and dynamic depth-wise convolution, particularly within the context of Vision Transformers (ViT) and their variants, such as the Local Vision Transformer. The authors reframe local attention as a channel-wise locally-connected layer and examine it through three lenses: sparse connectivity, weight sharing, and dynamic weight computation.

Theoretical Perspectives

Local attention, a core component of Local Vision Transformers, functions by partitioning input data into small local windows, allowing attention to be computed independently within these confines. This method enhances both memory and computational efficiency, enabling deeper neural networks to achieve state-of-the-art performance on visual tasks. Notably, it bears resemblance to the depth-wise convolution used in modern convolutional neural networks (CNNs), characterized by processing each channel independently with spatial local connections.

Sparse Connectivity and Similarities:

Both local attention and depth-wise convolution exhibit sparse connectivity. They limit connections within a small spatial window and do not extend connectivity across channels. This sparsity reduces model complexity while maintaining robust feature learning capabilities.

Weight Sharing Dynamics:

The weight sharing paradigms diverge between the two methods: depth-wise convolution maintains shared kernel weights across all spatial positions, whereas local attention shares weights across channels, allowing per position dynamic weight computation through dot-product calculations.

Dynamic Weights:

Dynamic weight formulations in local attention are accomplished via softmax normalization over dot-product similarity scores, adapting weights per input instance which enhances model generalization. The dynamic depth-wise convolution further explores both homogeneous and inhomogeneous variants, highlighting the computational advantages of instance-specific weights.

Empirical Evaluation

The authors conduct extensive empirical evaluations to confirm their theoretical insights. They replace local attention layers in the Swin Transformer—a prevalent Local Vision Transformer instance— with both static and dynamic depth-wise convolution layers, resulting in a new architecture termed DWNet.

Performance Comparisons:

The experiments on ImageNet classification, COCO object detection, and ADE20K semantic segmentation demonstrate that depth-wise convolution-based DWNets converge to similar accuracy levels as local attention-based architectures, often achieving slight improvements, especially when dynamic weights are employed. This finding underscores the practical robustness of depth-wise convolution in visual recognition tasks.

Implications and Future Directions

This paper broadens the understanding of how different architectural choices impact performance and efficiency across neural networks by establishing equivalences between sophisticated layer operations. It paves the way for developing more computationally efficient networks while preserving or even enhancing accuracy through the optimal orchestration of sparse connectivity, weight sharing, and dynamic adaptation.

Potential Areas for Further Exploration:

Exploration in Different Contexts: While this paper focuses on vision tasks, the principles could be adapted for NLP and other domains leveraging transformer architectures.
Enhancement of Dynamic Weight Schemes: Further paper on more refined dynamic weight prediction mechanisms could potentially unlock additional performance gains.
Architectural Fusion: The integration of depth-wise convolution’s computational efficiency with attention's adaptive capacity could herald a new class of hybrid architectures.

In conclusion, this paper provides a compelling case for revisiting the components of local attention and depth-wise convolution, offering pathways for more efficient and versatile transformer and CNN architectures.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Qi Han (46 papers)
Zejia Fan (4 papers)
Qi Dai (58 papers)
Lei Sun (138 papers)
Ming-Ming Cheng (185 papers)
Jiaying Liu (99 papers)
Jingdong Wang (236 papers)

Citations (91)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Atten4Vis/DemystifyLocalViT: Official code for paper "On the Connection between Local Attention and Dynamic Depth-wise Convolution" ICLR 2022 Spotlight (181 stars)