Performer Attention: Linear Transformer Efficiency

Updated 2 May 2026

Performer Attention is a linear self-attention mechanism that approximates softmax using kernel methods via the FAVOR+ framework.
It reduces quadratic complexity to linear time and memory, enabling scalable and efficient modeling for language, vision, speech, and biological sequences.
Extensions like PermuteFormer and SLiM further enhance performance by incorporating relative position encoding and block-wise memory optimization without compromising accuracy.

Performer Attention is a class of linear self-attention mechanisms that enable Transformer models to scale with linear time and memory complexity in sequence length, without relying on sparsity or low-rank constraints. At its core, Performer Attention replaces the quadratic-complexity softmax kernel of standard Transformers with a kernelizable approximation, primarily leveraging the Fast Attention Via positive Orthogonal Random features (FAVOR+) framework. This framework supports faithful and unbiased estimation of softmax (and other) kernels through positive random (or orthogonal random) feature maps. Performer Attention is widely applicable across modalities, showing competitive or superior empirical results in language, vision, speech, and biological sequence modeling. Notably, it solves several critical issues related to scalability, memory, and computation in deep sequence modeling.

1. Mathematical Formulation and Theoretical Guarantees

Standard softmax attention computes

$\operatorname{Att}(Q, K, V) = \mathrm{softmax}(Q K^\top / \sqrt{d}) V$

where $Q, K, V \in \mathbb{R}^{N \times d}$ and $N$ is the sequence length. Both runtime and memory are $O(N^2 d)$ due to explicit computation of the $N \times N$ attention matrix.

Performer Attention exploits the observation that $\exp(q^\top k)$ is a positive definite kernel. FAVOR+ introduces a random feature map $\phi: \mathbb{R}^d \to \mathbb{R}^m$ so that

$\mathbb{E}_\omega[\phi(q)^\top \phi(k)] = \exp(q^\top k)$

In the Performer, attention is approximated as

$\operatorname{Att}(Q, K, V) \approx D^{-1} \; \phi(Q) (\phi(K)^\top V)$

where $D$ is a normalization vector to preserve the row-sum property of softmax attention. The components of $Q, K, V \in \mathbb{R}^{N \times d}$ 0 in FAVOR+ leverage exponentials of orthogonal random projections:

$Q, K, V \in \mathbb{R}^{N \times d}$ 1

where $Q, K, V \in \mathbb{R}^{N \times d}$ 2 are orthogonal or Gaussian vectors.

Key theorems provide:

Unbiasedness of the approximation.
Closed-form variance estimates for the positive random feature estimator and reductions from orthogonality.
Uniform convergence over bounded regions: for $Q, K, V \in \mathbb{R}^{N \times d}$ 3, $Q, K, V \in \mathbb{R}^{N \times d}$ 4 with high probability, where $Q, K, V \in \mathbb{R}^{N \times d}$ 5 is the exact attention and $Q, K, V \in \mathbb{R}^{N \times d}$ 6 is the approximation (Choromanski et al., 2020).

This theoretical underpinning distinguishes Performers among linear-attention models.

2. Algorithmic Implementation and Complexity

Performer Attention's algorithm circumvents the explicit construction of the $Q, K, V \in \mathbb{R}^{N \times d}$ 7 attention matrix, instead utilizing kernel feature algebra:

Compute random feature maps for all queries and keys: $Q, K, V \in \mathbb{R}^{N \times d}$ 8.
Compute the intermediate matrix $Q, K, V \in \mathbb{R}^{N \times d}$ 9 (shape $N$ 0).
Compute the output as $N$ 1 (shape $N$ 2).
Apply diagonal normalization to ensure each query attends to appropriately normalized values.

For causal (autoregressive) attention, prefix sums are employed:

$N$ 3 enables $N$ 4 computation for sequential data.

The resulting time and memory complexity is $N$ 5 and $N$ 6, respectively, where typically $N$ 7, conferring linear scaling in $N$ 8 (Choromanski et al., 2020).

Performer Attention is fully compatible with the architecture and training routines of standard Transformers, supporting both bidirectional and causal variants.

3. Extensions: Relative Position Encoding and PermuteFormer

Classical relative-position encoding strategies inject biases or offsets into $N$ 9, incompatible with the kernel map approach since Performer operates purely on the level of transformed features rather than raw dot-products. PermuteFormer (Chen, 2021) achieves linear-time, pure relative-position encoding by parameterizing $O(N^2 d)$ 0 and $O(N^2 d)$ 1 with position-dependent operators:

$O(N^2 d)$ 2
$O(N^2 d)$ 3

where $O(N^2 d)$ 4 is a trainable scalar and $O(N^2 d)$ 5 is a permutation matrix on the feature dimension. This construction ensures that after the inner product, the similarity depends only on the relative distance $O(N^2 d)$ 6, preserving shift-invariance, while maintaining $O(N^2 d)$ 7 cost.

Empirical evaluation demonstrates that PermuteFormer outperforms both Performer and vanilla Transformer on long-range tasks (e.g., Long-Range Arena: 65.6% accuracy for PermuteFormer vs. 64.9% for both Performer and Transformer; WikiText-103: PermuteFormer achieves 32.49 perplexity, closer to the Transformer's 30.18 than Performer’s 36.87), without additional computational overhead (Chen, 2021).

4. Memory Optimization: Sub-Linear Memory (SLiM)

Standard Performers require $O(N^2 d)$ 8 memory, but SLiM (Likhosherstov et al., 2020) introduces a block-wise streaming regime that reduces memory to $O(N^2 d)$ 9 for slice size $N \times N$ 0. During training or fine-tuning, the input sequence is divided into contiguous slices; only the states (prefix sums) at slice boundaries are stored and propagated.

Algorithmic highlights:

Forward pass streams over slices, carrying only current slice activations and prefix-sum states.
Backward pass recomputes local objectives per slice via autodifferentiation, accumulating gradients.
No approximation is introduced; the memory–parallelism trade-off can be tuned by $N \times N$ 1.

In the extreme, this allows $N \times N$ 2 memory per attention layer (plus state), facilitating on-device and low-memory training or fine-tuning with no compromise in estimation accuracy (Likhosherstov et al., 2020).

Model Component	Serial Time	Memory
Softmax Attention	$N \times N$ 3	$N \times N$ 4
Performer (full prefix)	$N \times N$ 5	$N \times N$ 6
SLiM ( $N \times N$ 7 tokens per slice)	$N \times N$ 8	$N \times N$ 9

5. Empirical Evaluation Across Modalities

Performer Attention demonstrates robust empirical performance:

Language modeling: On datasets such as LM1B and PG-19, Performers recover or match Transformer perplexities with less than 10% of the fine-tuning steps. On the Long-Range Arena, Performers attain top accuracy among linear-time approaches and near-Transformer-level results (Choromanski et al., 2020).
Vision: In pixel prediction/ImageNet64, unidirectional Performer with 6 layers matches a 12-layer Reformer, and 12-layer Performer matches a 24-layer Reformer, with a substantial speed improvement.
Biological sequences: On TrEMBL (L=1024), Performers achieve 36.1% bidirectional accuracy vs. 33.3% for Transformers; for L=8192, Performer retains 24% accuracy vs. Transformer’s drop to 19%.
Speech and spoken language identification: Performer-based models integrated into statistical pooling, as in (dhiman et al., 9 Feb 2025), yield accuracy and macro-F1 improvements over self-attention, e.g., on FLEURS (+18.2% accuracy), VoxLingua (+5.8%), VoxPopuli (+0.2%). Computational cost increases are modest relative to the gains.

A notable application in efficient end-to-end speech recognition demonstrates that replacing Transformer modules in Conformer architectures with Performers yields competitive performance on the LibriSpeech corpus with linear complexity and about 20% WER reduction over prior lightweight models (Wang et al., 2020).

6. Comparison, Limitations, and Future Directions

Advantages of Performer Attention include:

Provably unbiased or nearly unbiased softmax attention approximation with $\exp(q^\top k)$ 0 runtime and memory.
Uniform error bounds independent of sequence length $\exp(q^\top k)$ 1.
Backwards compatibility: pre-trained Transformer weights can be fine-tuned via Performer layers.
Easily extensible: supports kernel variants, position-aware extensions, sublinear memory, and bidirectional and causal regimes.

Some limitations persist:

Required number of random features $\exp(q^\top k)$ 2 grows with the hidden dimension $\exp(q^\top k)$ 3 and the desired error tolerance, impacting constants for large $\exp(q^\top k)$ 4.
In rare cases, variance under certain tail draws of random features may necessitate redrawing or larger $\exp(q^\top k)$ 5.
Integration of standard relative position encodings or other dot-product-based attention tricks requires significant architectural modifications (Chen, 2021).

Ongoing research directions include adaptive or learned feature maps, hybrid sparse-kernel methods, quasi-Monte Carlo integration to reduce $\exp(q^\top k)$ 6, combination with memory-efficient architectures (e.g., reversible layers), and hardware-specific kernel optimization (Choromanski et al., 2020).

Performer Attention belongs to a class of linear-attention mechanisms that attempt to reduce the computational and memory overhead of the Transformer. Unlike sparse and low-rank models, Performers do not explicitly prune attention: instead, the random feature kernelization retains access to the full, global context. This property yields robust performance across modalities.

Extensions, such as the Actor-Conditioned Attention Map (ACAM) model (Ulutan et al., 2018), use context- or instance-conditioned attention maps for fine-grained spatiotemporal relevance, but are architecturally distinct in purpose, focusing on attention in video action detection rather than computational scaling. Performer and its derivatives are directly concerned with faithfully approximating softmax attention or general kernels with a mathematically rigorous sketching approach that is scalable and flexible.

In summary, Performer Attention operates at the intersection of scalable kernel approximation, efficient Transformer architecture, and practical deployment for long-sequence and resource-constrained application domains. Its theoretical and empirical robustness underpins a significant class of state-of-the-art deep learning systems.