Papers
Topics
Authors
Recent
Search
2000 character limit reached

Focused Linear Attention (FLatten)

Updated 21 April 2026
  • Focused Linear Attention (FLatten) is a mechanism that combines linear attention's efficiency with a focused nonlinear mapping and depthwise convolution to restore rank and enhance feature diversity.
  • It sharpens attention distributions by increasing angular separation among key features, addressing the rank deficiency inherent in vanilla linear attention.
  • Empirical results show FLatten achieves up to 2× speedup with improved accuracy in high-resolution vision, local feature matching, and speech separation tasks.

Focused Linear Attention (FLatten) refers to a family of attention mechanisms designed to combine the computational efficiency of linear attention with the sharp focus and feature diversity characteristic of classical softmax-based self-attention. FLatten mechanisms have been applied in computer vision and sequence modeling, most notably in transformers for high-resolution vision tasks, local feature matching, and speech separation. The core innovation of FLatten is a focused nonlinear mapping that sharpens attention distributions and a rank restoration module based on depthwise convolution, resulting in linear time and memory complexity while mitigating the rank deficiency and smoothness typically observed in vanilla linear attention. This approach is exemplified by the FLatten Transformer (Han et al., 2023), LoFLAT (Cao et al., 2024), and FLASepformer (Wang et al., 27 Aug 2025).

1. Mathematical Foundations

Standard self-attention with softmax is given by

Attsoft(Q,K,V)=softmax(QKd)V,\operatorname{Att}_{\text{soft}}(Q, K, V) = \operatorname{softmax}\left(\frac{Q K^\top}{\sqrt{d}}\right)V,

which incurs O(N2d)\mathcal{O}(N^2 d) time and memory for NN tokens of dimension dd. Linear attention approximates softmax using a positive mapping ϕ()\phi(\cdot):

AttL(Q,K,V)=ϕ(Q)[ϕ(K)V],\operatorname{Att}_{L}(Q, K, V) = \phi(Q) [\phi(K)^\top V],

with cost O(Nd2)\mathcal{O}(N d^2) for fixed dd.

FLatten introduces a focused kernel ϕp\phi_p, defined as follows. For xRdx \in \mathbb{R}^d:

O(N2d)\mathcal{O}(N^2 d)0

where O(N2d)\mathcal{O}(N^2 d)1 denotes elementwise exponentiation and O(N2d)\mathcal{O}(N^2 d)2 ensures nonnegativity. The resulting focused linear attention computes:

O(N2d)\mathcal{O}(N^2 d)3

where O(N2d)\mathcal{O}(N^2 d)4 is a depthwise convolution applied channel-wise to O(N2d)\mathcal{O}(N^2 d)5 for rank restoration (Han et al., 2023, Cao et al., 2024, Wang et al., 27 Aug 2025).

2. Focus Sharpening and Feature Diversity

Softmax-based attention yields highly concentrated maps allowing queries to selectively attend to informative keys. Vanilla linear kernels tend to produce diffuse, low-rank attention, making them suboptimal in high-resolution or long-sequence regimes. The O(N2d)\mathcal{O}(N^2 d)6 mapping in FLatten increases the angular separation between query and key vectors as O(N2d)\mathcal{O}(N^2 d)7 grows, enhancing focus on dominant features and suppressing irrelevant dimensions. Proposition 1 in (Han et al., 2023) formalizes that O(N2d)\mathcal{O}(N^2 d)8 increases cosine similarity for shared dominant indices, leading to sharper, more discriminative attention.

Linear attention's rank is bounded by O(N2d)\mathcal{O}(N^2 d)9 (NN0), causing the effective attention map to lack diversity. Injection of a small depthwise convolution (kernel NN1 or 1D for sequential data) over NN2 is mathematically equivalent to adding a sparse, full-rank correction, thus restoring per-token expressiveness and breaking the degenerate rank constraint (Han et al., 2023, Cao et al., 2024, Wang et al., 27 Aug 2025).

3. Computational Complexity and Efficiency

The computational and memory complexity of different attention schemes is summarized below:

Attention Type Time Complexity Memory Complexity Expressiveness
Softmax (MHSA) NN3 NN4 High
Linear (kernel) NN5 NN6 Low (Rank NN7)
Focused Linear (FLatten) NN8 NN9 High (Full rank)

Depthwise convolution's cost (dd0 for images, dd1 for sequences) is minimal relative to the quadratic component eliminated by linearization. Empirically, FLatten achieves up to dd2 speedup with either comparable or better accuracy on high-resolution tasks (Han et al., 2023). In speech separation, FLASepformer attains dd3 faster inference and only 20.9% GPU memory usage compared to SepReformer while closely matching SI-SNRi accuracy (Wang et al., 27 Aug 2025). LoFLAT demonstrates similar efficiency with increased accuracy over LoFTR for local feature matching (Cao et al., 2024).

4. Architectural Integration and Variants

FLatten has been integrated into diverse transformer-based architectures:

  • Vision Transformers: FLatten replaces the softmax attention in early ViT stages and is compatible with DeiT, PVT, Swin, and similar models (Han et al., 2023). The depthwise convolution operates spatially.
  • Local Feature Matching (LoFLAT): The Feature Transformer Module applies FLatten with a dd4 2D depthwise convolution per channel. Feature Extraction utilizes ResNet + FPN. A coarse-to-fine Matching Module exploits the improved map focus and diversity for robust matching (Cao et al., 2024).
  • Speech Separation (FLASepformer): FLASepformer (Wang et al., 27 Aug 2025) deploys FLatten (with 1D depthwise convolution) in the global attention modules of SepReformer and in the temporal blocks of TF-Locoformer, both replacing quadratic MHSA. A gated MLP module with LayerNorm and channel-wise gating further enhances per-token modulations. Hyperparameters such as focus exponent dd5 and kernel dd6 are used for robust performance.

5. Empirical Results and Comparisons

Extensive benchmarks validate FLatten's efficacy:

  • Image Classification (Han et al., 2023): DeiT-Tiny accuracy improves from 72.2% (softmax) to 74.1% (FLatten); Swin-Tiny improves from 81.3% to 82.1%. Performance gains are consistent across segmentation (ADE20K mIoU) and object detection (COCO AP).
  • Speech Separation (Wang et al., 27 Aug 2025): FLA-SepReformer matches or nearly matches SepReformer SI-SNRi while enabling dd7 acceleration and linear memory scaling. FLA-TFLocoformer achieves similar SI-SNRi with dd8 of the original GPU memory budget.
  • Local Feature Matching (Cao et al., 2024): On MegaDepth, LoFLAT increases AUC @5°, @10°, and @20° by 2.7%, 1.9%, and 0.9% respectively over LoFTR, yielding denser and more robust matches.

Ablation studies confirm that the focused mapping improves accuracy significantly over vanilla linear attention, and further gains accrue when the depthwise convolution rank-restoration module is added. The choice of exponent dd9 is robust; ϕ()\phi(\cdot)0 is typically optimal (Han et al., 2023, Wang et al., 27 Aug 2025).

6. Implementation Guidelines and Limitations

Key implementation suggestions include:

  • Focus Exponent: Default ϕ()\phi(\cdot)1; variation in ϕ()\phi(\cdot)2 has minor effects on accuracy unless set very high or low (Han et al., 2023, Wang et al., 27 Aug 2025).
  • Depthwise Convolution: Kernel size ϕ()\phi(\cdot)3 for images and ϕ()\phi(\cdot)4 for sequential data are effective; size variations yield diminishing returns (Cao et al., 2024, Wang et al., 27 Aug 2025).
  • Gating (for FLASepformer): LayerNorm prior to gating and a single or two-layer linear projection with nonlinearities are effective.
  • Integration Points: Plugging FLatten into early, high-resolution transformer blocks yields maximum efficiency benefit (Han et al., 2023).

Limitations include a need to tune the additional focus parameter ϕ()\phi(\cdot)5, potential ineffectiveness of depthwise convolution if local structure is weak, and the under-weighting of extremely long-range interactions—a generic limitation of all linear attention methods (Cao et al., 2024). In speech separation, moderate SI-SNRi drops of 0.2–0.3 dB are observed versus full MHSA, but with large gains in scalability and speed (Wang et al., 27 Aug 2025).

7. Research Directions and Outlook

Potential future research includes:

  • Learnable Mapping Exponents: Making ϕ()\phi(\cdot)6 learnable for task adaptation.
  • Multi-Kernel Rank Restoration: Using richer or adaptive convolutional kernels to further increase feature diversity.
  • Applicability to Cross-Attention: Extending FLatten to decoder or cross-attention mechanisms in advanced architectures.
  • Broader Modalities: Due to linear scaling, practical for large-context, high-resolution, or long-sequence modeling across vision, audio, and multimodal domains (Han et al., 2023, Cao et al., 2024, Wang et al., 27 Aug 2025).

The Focused Linear Attention paradigm offers a unifying, resource-efficient framework that matches or outperforms prior linear attention methods and, in several regimes, surpasses softmax self-attention in accuracy‐throughput trade-offs.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Focused Linear Attention (FLatten).