Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FLatten Transformer: Vision Transformer using Focused Linear Attention (2308.00442v2)

Published 1 Aug 2023 in cs.CV

Abstract: The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks. Linear attention, on the other hand, offers a much more efficient alternative with its linear complexity by approximating the Softmax operation through carefully designed mapping functions. However, current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead from the mapping functions. In this paper, we propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness. Specifically, we first analyze the factors contributing to the performance degradation of linear attention from two perspectives: the focus ability and feature diversity. To overcome these limitations, we introduce a simple yet effective mapping function and an efficient rank restoration module to enhance the expressiveness of self-attention while maintaining low computation complexity. Extensive experiments show that our linear attention module is applicable to a variety of advanced vision Transformers, and achieves consistently improved performances on multiple benchmarks. Code is available at https://github.com/LeapLabTHU/FLatten-Transformer.

Overview of "FLatten Transformer: Vision Transformer using Focused Linear Attention"

The paper "FLatten Transformer: Vision Transformer using Focused Linear Attention" addresses the critical issue of computational complexity in self-attention mechanisms within Vision Transformers. It introduces a novel Focused Linear Attention (FLatten) module as a solution that maintains the expressiveness of self-attention while improving computational efficiency.

Core Problem and Approach

Vision Transformers have shown significant potential across various tasks. However, their application is hindered by the quadratic complexity associated with traditional Softmax-based self-attention. Linear attention models offer an alternative with linear complexity but often suffer from performance degradation or incur additional computational burdens due to complex mapping functions.

The authors propose a Focused Linear Attention module that achieves a balance between efficiency and effectiveness. Two primary issues are addressed: the attenuated focus ability and reduced feature diversity in existing linear attention models.

Innovative Contributions

  1. Focused Function: To tackle the smooth distribution problem in linear attention, a mapping function named Focused Function is introduced. This function enhances the focus ability by adjusting the direction of query and key features, making attention weights more distinct.
  2. Rank Restoration Module: To restore feature diversity, a simple depthwise convolution (DWC) is employed. This module improves the rank of the attention matrix, facilitating diverse feature representation without significant computational overhead.

Experimental Validation

The effectiveness of the proposed approach is validated across various tasks, including image classification, semantic segmentation, and object detection. The experiments demonstrate consistent performance improvements over baseline models and other linear attention techniques. Notably, strong numerical improvements are reported for models such as DeiT, PVT, and Swin Transformer variations.

Implications and Future Directions

The proposed Focused Linear Attention module not only reduces computational complexity but also can be integrated into different architectures easily, making it suitable for various vision tasks. By achieving comparable or superior performance to Softmax attention, this work paves the way for more efficient Vision Transformers.

Implications:

  • Practical: This method allows the deployment of Vision Transformers in resource-constrained environments, potentially expanding their applicability in real-time and mobile applications.
  • Theoretical: The introduction of a rank restoration mechanism and focused function enriches the design space for attention mechanisms in transformers, offering a framework that could be further optimized or extended.

Future Directions:

  • Optimization and Adaptation: Further fine-tuning of the Focused Function for different tasks and model architectures could enhance performance.
  • Broader Applications: Applying this linear attention mechanism to multi-modal and sequential vision tasks could reveal its adaptability and usability across domains.
  • Collaborative Architectures: Exploring hybrid models that combine convolutional networks with improved linear attention could yield even better efficiency-performance trade-offs.

In conclusion, the FLatten module represents a significant step in the ongoing evolution of efficient Transformer architectures, marking a promising avenue for future research and development in computer vision tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Dongchen Han (12 papers)
  2. Xuran Pan (14 papers)
  3. Yizeng Han (33 papers)
  4. Shiji Song (103 papers)
  5. Gao Huang (178 papers)
Citations (107)
Github Logo Streamline Icon: https://streamlinehq.com