LocalViT: Bringing Locality to Vision Transformers (2104.05707v1)

Published 12 Apr 2021 in cs.CV

Abstract: We study how to introduce locality mechanisms into vision transformers. The transformer network originates from machine translation and is particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking a locality mechanism for information exchange within a local region. Yet, locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects. We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and all proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to 4 vision transformers, which shows the generalization of the locality concept. In particular, for ImageNet2012 classification, the locality-enhanced transformers outperform the baselines DeiT-T and PVT-T by 2.6\% and 3.1\% with a negligible increase in the number of parameters and computational effort. Code is available at \url{https://github.com/ofsoundof/LocalViT}.

Citations (416)

View on Semantic Scholar

Summary

The paper introduces a novel method of embedding depth-wise convolution into transformers to incorporate local features, enhancing image classification.
It replaces fully connected layers with 1x1 convolutions and supplements them with depth-wise convolutions, inspired by MobileNet's inverted residual blocks.
Experimental results show accuracy improvements of up to 3.1% with minimal computational cost, emphasizing the benefits of early local feature integration.

LocalViT: Introducing Locality to Vision Transformers

Vision Transformers (ViTs) represent a significant advancement in computer vision, originally designed to leverage transformer networks’ ability to model long-range dependencies. Despite this, ViTs traditionally lack a mechanism to exploit local features, which are crucial for image tasks, as they capture structures like edges, lines, and shapes. The paper "LocalViT: Bringing Locality to Vision Transformers" by Li et al. introduces an approach to incorporate locality into vision transformers by integrating depth-wise convolution within the feed-forward network of transformers.

Methodology

The authors draw inspiration from the architecture of inverted residual blocks used in MobileNets, which parallel the structure of feed-forward networks in transformers. By substituting fully connected layers with $1 \times 1$ convolutions and integrating a depth-wise convolution, locality is introduced efficiently, mirroring how inverted residuals perform. This method efficiently combines local information processing with the global-reaching self-attention layers of transformers.

A critical design decision involves reshaping the sequence of image tokens into a 2D feature map so that depth-wise convolution can operate spatially, affording the network the ability to capture local context without altering the transformer’s overarching architecture. This structured approach enables seamless integration while maintaining computational efficiency.

Results and Analysis

Experimental results demonstrate that the proposed LocalViT models outperform baseline transformers such as DeiT-T and PVT-T, with accuracy improvements of 2.6% and 3.1% respectively, with minimal additional computational requirements. Variations in the network’s non-linear activation functions, placement of such enhancements across layers, and the expansion ratio of the feed-forward network were examined.

The exploration into activation functions, specifically h-swish combined with channel attention modules like SE and ECA, yielded notable accuracy improvements without a significant increase in computational cost, highlighting the activation function's critical role in optimizing model performance. Further analysis shows that introducing locality at the lower layers of the network particularly enhances performance, underlining the importance of capturing local context early in the processing pipeline.

Implications and Future Direction

The findings indicate that the integration of locality can significantly enhance the performance of vision transformers, making them competitive with or superior to existing convolutional neural networks in image classification tasks. This approach not only bridges a conceptual gap between CNNs and transformers but also lays the groundwork for further exploration into hybrid models that leverage both local and global processing capabilities.

Future endeavors could explore extending the LocalViT framework to other vision tasks beyond classification such as detection, segmentation, and video processing, where both local and global features are essential. The fusion of transformers' and CNNs' distinctive strengths could lead to more robust, versatile networks. Moreover, understanding the balance and trade-offs between local and global capabilities in transformers could present opportunities for more efficient design adaptations tailored to specific applications or hardware constraints.

The work of Li et al. underscores the transformative potential of marrying locality with global self-attention, paving the way for a nuanced enhancement of vision transformer architectures.

PDF Markdown

Related Papers

GitHub

GitHub - ofsoundof/LocalViT (113 stars)

Tweets

https://twitter.com/stochasticchasm/status/1827848317990289851