Fast Vision Transformers with HiLo Attention (2205.13213v5)

Published 26 May 2022 in cs.CV, cs.AI, and cs.LG

Abstract: Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i.e., FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of different frequencies. Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group encodes low frequencies by performing global attention between the average-pooled low-frequency keys and values from each window and each query position in the input feature map. Benefiting from the efficient design for both groups, we show that HiLo is superior to the existing attention mechanisms by comprehensively benchmarking FLOPs, speed and memory consumption on GPUs and CPUs. For example, HiLo is 1.4x faster than spatial reduction attention and 1.6x faster than local window attention on CPUs. Powered by HiLo, LITv2 serves as a strong backbone for mainstream vision tasks including image classification, dense detection and segmentation. Code is available at https://github.com/ziplab/LITv2.

PDF Abstract

Fast Vision Transformers with HiLo Attention

The paper introduces LITv2, an efficient and effective vision Transformer that aims to improve the speed and accuracy of ViTs on various tasks by employing a novel self-attention mechanism, HiLo. Vision Transformers (ViTs) have become a cornerstone in computer vision, yet designing them to be both fast and accurate remains challenging. Traditional measures of computational complexity like FLOPs (floating-point operations per second) do not fully capture the actual runtime performance across different platforms, motivating the need for more practical design metrics like throughput.

Core Contribution: HiLo Attention

The primary innovation in LITv2 is the HiLo attention mechanism. HiLo improves upon traditional multi-head self-attention by exploiting the frequency characteristics inherent in visual data. Specifically, HiLo separates attention heads into two categories: those focused on high-frequency components and those dedicated to low-frequency components. High frequencies, responsible for capturing fine details, are processed locally within small windows, while low frequencies, encoding broader structures, are handled globally. This division allows HiLo to reduce memory and compute complexity while maintaining performance.

Technical Efficacy

The proposed HiLo mechanism offers substantial improvements in speed and resource use. Empirical results demonstrate that HiLo-attentive ViTs are more efficient than existing attention paradigms, achieving faster processing times. For example, LITv2 outperforms alternatives such as spatial reduction attention and local window attention on CPUs, being 1.4× and 1.6× faster, respectively. These results are confirmed across benchmarks concerning FLOPs, throughput, and memory consumption on both GPUs and CPUs.

Architectural and Practical Implications

LITv2 leverages HiLo at its core, showcasing significant performance enhancements across various vision tasks, including image classification, object detection, instance segmentation, and semantic segmentation. Through experiments on renowned datasets like ImageNet, COCO, and ADE20K, LITv2 consistently shows superior speed-accuracy trade-offs in real-world applications.

The practical implication of these results suggests that using throughput as a design principle, complemented by the HiLo mechanism, can lead to more efficient and deployable vision models. This is particularly beneficial for applications requiring low-latency processing, such as on-device computation in autonomous systems or mobile platforms.

Speculation on Future Research

Future research could further explore the nuances of high and low-frequency separation in attention mechanisms, potentially integrating more complex frequency analyses or adaptive windowing techniques. Additionally, extending the HiLo attention paradigm to other domains such as audio or hybrid modalities in multi-sensory learning could be a promising direction. This could lead to a more unified theory of attention mechanisms across different data types.

Furthermore, the interplay between attention and convolutional architectures hinted at by the introduction of depthwise convolution layers for implicit positional encoding addresses another aspect of optimizing Transformer models. This convergence of convolutional and Transformer paradigms could inspire novel hybrid architectures.

Conclusion

The paper "Fast Vision Transformers with HiLo Attention" contributes significantly to efficient ViT design, emphasizing the importance of developing attention mechanisms that are not only theoretically but practically efficient. LITv2, with its HiLo attention, paves the way for more resource-efficient vision Transformers, ensuring that they remain applicable in environments with constrained computational capabilities. This work is expected to influence the design of future vision architectures and extends the applicability of Transformers in time-sensitive applications.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Zizheng Pan (23 papers)
Jianfei Cai (163 papers)
Bohan Zhuang (79 papers)

Citations (120)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - ziplab/LITv2: [NeurIPS 2022 Spotlight] This is the official PyTorch implementation of "Fast Vision Transformers with HiLo Attention" (223 stars)