Focused Linear Attention (FLatten)

Updated 21 April 2026

Focused Linear Attention (FLatten) is a mechanism that combines linear attention's efficiency with a focused nonlinear mapping and depthwise convolution to restore rank and enhance feature diversity.
It sharpens attention distributions by increasing angular separation among key features, addressing the rank deficiency inherent in vanilla linear attention.
Empirical results show FLatten achieves up to 2× speedup with improved accuracy in high-resolution vision, local feature matching, and speech separation tasks.

Focused Linear Attention (FLatten) refers to a family of attention mechanisms designed to combine the computational efficiency of linear attention with the sharp focus and feature diversity characteristic of classical softmax-based self-attention. FLatten mechanisms have been applied in computer vision and sequence modeling, most notably in transformers for high-resolution vision tasks, local feature matching, and speech separation. The core innovation of FLatten is a focused nonlinear mapping that sharpens attention distributions and a rank restoration module based on depthwise convolution, resulting in linear time and memory complexity while mitigating the rank deficiency and smoothness typically observed in vanilla linear attention. This approach is exemplified by the FLatten Transformer (Han et al., 2023), LoFLAT (Cao et al., 2024), and FLASepformer (Wang et al., 27 Aug 2025).

1. Mathematical Foundations

Standard self-attention with softmax is given by

$\operatorname{Att}_{\text{soft}}(Q, K, V) = \operatorname{softmax}\left(\frac{Q K^\top}{\sqrt{d}}\right)V,$

which incurs $\mathcal{O}(N^2 d)$ time and memory for $N$ tokens of dimension $d$ . Linear attention approximates softmax using a positive mapping $\phi(\cdot)$ :

$\operatorname{Att}_{L}(Q, K, V) = \phi(Q) [\phi(K)^\top V],$

with cost $\mathcal{O}(N d^2)$ for fixed $d$ .

FLatten introduces a focused kernel $\phi_p$ , defined as follows. For $x \in \mathbb{R}^d$ :

$\mathcal{O}(N^2 d)$ 0

where $\mathcal{O}(N^2 d)$ 1 denotes elementwise exponentiation and $\mathcal{O}(N^2 d)$ 2 ensures nonnegativity. The resulting focused linear attention computes:

$\mathcal{O}(N^2 d)$ 3

where $\mathcal{O}(N^2 d)$ 4 is a depthwise convolution applied channel-wise to $\mathcal{O}(N^2 d)$ 5 for rank restoration (Han et al., 2023, Cao et al., 2024, Wang et al., 27 Aug 2025).

2. Focus Sharpening and Feature Diversity

Softmax-based attention yields highly concentrated maps allowing queries to selectively attend to informative keys. Vanilla linear kernels tend to produce diffuse, low-rank attention, making them suboptimal in high-resolution or long-sequence regimes. The $\mathcal{O}(N^2 d)$ 6 mapping in FLatten increases the angular separation between query and key vectors as $\mathcal{O}(N^2 d)$ 7 grows, enhancing focus on dominant features and suppressing irrelevant dimensions. Proposition 1 in (Han et al., 2023) formalizes that $\mathcal{O}(N^2 d)$ 8 increases cosine similarity for shared dominant indices, leading to sharper, more discriminative attention.

Linear attention's rank is bounded by $\mathcal{O}(N^2 d)$ 9 ( $N$ 0), causing the effective attention map to lack diversity. Injection of a small depthwise convolution (kernel $N$ 1 or 1D for sequential data) over $N$ 2 is mathematically equivalent to adding a sparse, full-rank correction, thus restoring per-token expressiveness and breaking the degenerate rank constraint (Han et al., 2023, Cao et al., 2024, Wang et al., 27 Aug 2025).

3. Computational Complexity and Efficiency

The computational and memory complexity of different attention schemes is summarized below:

Attention Type	Time Complexity	Memory Complexity	Expressiveness
Softmax (MHSA)	$N$ 3	$N$ 4	High
Linear (kernel)	$N$ 5	$N$ 6	Low (Rank $N$ 7)
Focused Linear (FLatten)	$N$ 8	$N$ 9	High (Full rank)

Depthwise convolution's cost ( $d$ 0 for images, $d$ 1 for sequences) is minimal relative to the quadratic component eliminated by linearization. Empirically, FLatten achieves up to $d$ 2 speedup with either comparable or better accuracy on high-resolution tasks (Han et al., 2023). In speech separation, FLASepformer attains $d$ 3 faster inference and only 20.9% GPU memory usage compared to SepReformer while closely matching SI-SNRi accuracy (Wang et al., 27 Aug 2025). LoFLAT demonstrates similar efficiency with increased accuracy over LoFTR for local feature matching (Cao et al., 2024).

4. Architectural Integration and Variants

FLatten has been integrated into diverse transformer-based architectures:

Vision Transformers: FLatten replaces the softmax attention in early ViT stages and is compatible with DeiT, PVT, Swin, and similar models (Han et al., 2023). The depthwise convolution operates spatially.
Local Feature Matching (LoFLAT): The Feature Transformer Module applies FLatten with a $d$ 4 2D depthwise convolution per channel. Feature Extraction utilizes ResNet + FPN. A coarse-to-fine Matching Module exploits the improved map focus and diversity for robust matching (Cao et al., 2024).
Speech Separation (FLASepformer): FLASepformer (Wang et al., 27 Aug 2025) deploys FLatten (with 1D depthwise convolution) in the global attention modules of SepReformer and in the temporal blocks of TF-Locoformer, both replacing quadratic MHSA. A gated MLP module with LayerNorm and channel-wise gating further enhances per-token modulations. Hyperparameters such as focus exponent $d$ 5 and kernel $d$ 6 are used for robust performance.

5. Empirical Results and Comparisons

Extensive benchmarks validate FLatten's efficacy:

Image Classification (Han et al., 2023): DeiT-Tiny accuracy improves from 72.2% (softmax) to 74.1% (FLatten); Swin-Tiny improves from 81.3% to 82.1%. Performance gains are consistent across segmentation (ADE20K mIoU) and object detection (COCO AP).
Speech Separation (Wang et al., 27 Aug 2025): FLA-SepReformer matches or nearly matches SepReformer SI-SNRi while enabling $d$ 7 acceleration and linear memory scaling. FLA-TFLocoformer achieves similar SI-SNRi with $d$ 8 of the original GPU memory budget.
Local Feature Matching (Cao et al., 2024): On MegaDepth, LoFLAT increases AUC @5°, @10°, and @20° by 2.7%, 1.9%, and 0.9% respectively over LoFTR, yielding denser and more robust matches.

Ablation studies confirm that the focused mapping improves accuracy significantly over vanilla linear attention, and further gains accrue when the depthwise convolution rank-restoration module is added. The choice of exponent $d$ 9 is robust; $\phi(\cdot)$ 0 is typically optimal (Han et al., 2023, Wang et al., 27 Aug 2025).

6. Implementation Guidelines and Limitations

Key implementation suggestions include:

Focus Exponent: Default $\phi(\cdot)$ 1; variation in $\phi(\cdot)$ 2 has minor effects on accuracy unless set very high or low (Han et al., 2023, Wang et al., 27 Aug 2025).
Depthwise Convolution: Kernel size $\phi(\cdot)$ 3 for images and $\phi(\cdot)$ 4 for sequential data are effective; size variations yield diminishing returns (Cao et al., 2024, Wang et al., 27 Aug 2025).
Gating (for FLASepformer): LayerNorm prior to gating and a single or two-layer linear projection with nonlinearities are effective.
Integration Points: Plugging FLatten into early, high-resolution transformer blocks yields maximum efficiency benefit (Han et al., 2023).

Limitations include a need to tune the additional focus parameter $\phi(\cdot)$ 5, potential ineffectiveness of depthwise convolution if local structure is weak, and the under-weighting of extremely long-range interactions—a generic limitation of all linear attention methods (Cao et al., 2024). In speech separation, moderate SI-SNRi drops of 0.2–0.3 dB are observed versus full MHSA, but with large gains in scalability and speed (Wang et al., 27 Aug 2025).

7. Research Directions and Outlook

Potential future research includes:

Learnable Mapping Exponents: Making $\phi(\cdot)$ 6 learnable for task adaptation.
Multi-Kernel Rank Restoration: Using richer or adaptive convolutional kernels to further increase feature diversity.
Applicability to Cross-Attention: Extending FLatten to decoder or cross-attention mechanisms in advanced architectures.
Broader Modalities: Due to linear scaling, practical for large-context, high-resolution, or long-sequence modeling across vision, audio, and multimodal domains (Han et al., 2023, Cao et al., 2024, Wang et al., 27 Aug 2025).

The Focused Linear Attention paradigm offers a unifying, resource-efficient framework that matches or outperforms prior linear attention methods and, in several regimes, surpasses softmax self-attention in accuracy‐throughput trade-offs.