FAVOR+ Attention in Scalable Transformers

Updated 5 January 2026

FAVOR+ Attention is a scalable mechanism that approximates softmax attention using positive orthogonal random features to transform kernelized queries and keys.
It leverages Monte Carlo methods and orthogonal random feature maps to achieve unbiased, low-variance approximations, reducing time and space complexities from quadratic to linear.
Empirical results in architectures like Performer and DF-Conformer demonstrate up to 8× speedup and effective modeling of long sequences in various tasks.

FAVOR+ Attention (Fast Attention Via positive Orthogonal Random Features) is a Monte Carlo kernel approximation scheme for scalable attention in neural sequence models, designed to approximate the softmax attention mechanism with provable accuracy while reducing time and space complexity from quadratic to linear in sequence length. This mechanism is central to Performer and subsequent architectures such as DF-Conformer, enabling practical modeling of long sequences in settings where explicit computation of the attention matrix is computationally prohibitive (Choromanski et al., 2020, Seki et al., 4 Nov 2025).

1. Motivation and Problem Setting

Traditional softmax attention in Transformers requires computation and storage of a full $L \times L$ attention matrix (where $L$ is the sequence length), resulting in $O(L^2 d)$ time and $O(L^2)$ memory complexity. For tasks with long contexts (e.g., sequence lengths $L \gg 10^3$ ), these requirements are prohibitively expensive. To address this, FAVOR+ constructs an unbiased, low-variance approximation to the softmax attention kernel, reducing complexity without introducing sparsity or low-rank constraints (Choromanski et al., 2020).

2. Mathematical Foundation: Kernelization of Softmax Attention

The central observation is that softmax attention is kernelizable: $A_{ij} = \exp\left( \frac{q_i \cdot k_j}{\sqrt{d}} \right) = K(q_i, k_j)$ with $K(x, y) = \exp(x^\top y)$ . For any positive-definite kernel $K(x, y)$ expressible as an expectation over features, $K(x, y) = \mathbb{E}_\omega[\phi(x; \omega)\phi(y; \omega)]$ , the attention mechanism can be approximated in feature space. This approach enables reframing matrix products involving $A$ in linear rather than quadratic time, provided a suitable $L$ 0 can be constructed (Choromanski et al., 2020, Seki et al., 4 Nov 2025).

3. FAVOR+ Random Feature Maps

FAVOR+ uses positive orthogonal random features to approximate the exponential kernel. The construction proceeds as follows:

For each input $L$ 1, and a set of $L$ 2 orthogonal random vectors $L$ 3, define

$L$ 4

For reduced variance, a "hyp+" variant uses pairs $L$ 5 and outputs

$L$ 6

Orthogonality of $L$ 7 is achieved by sampling a Gaussian matrix, then orthonormalizing its rows (gram–schmidt, Householder, or Hadamard methods).

These features are positive-valued, preserve unbiasedness, and crucially reduce variance relative to i.i.d. sampled features (Choromanski et al., 2020, Seki et al., 4 Nov 2025).

4. Linear-Time Attention via FAVOR+

Given matrices $L$ 8, the transformation proceeds as follows:

Compute random-feature projections:

$L$ 9

Approximate the attention output:

$O(L^2 d)$ 0

where $O(L^2 d)$ 1

Time and space complexity per block become $O(L^2 d)$ 2 and $O(L^2 d)$ 3, with $O(L^2 d)$ 4 typically $O(L^2 d)$ 5– $O(L^2 d)$ 6, representing a substantial reduction compared to the $O(L^2 d)$ 7 and $O(L^2 d)$ 8 baseline (Choromanski et al., 2020, Seki et al., 4 Nov 2025).

5. Theoretical Guarantees

FAVOR+ is a provably unbiased estimator: $O(L^2 d)$ 9 Variance bounds are established, with mean-squared error for the approximation decreasing as $O(L^2)$ 0. Orthogonal features further reduce variance, with

$O(L^2)$ 1

Uniform convergence guarantees ensure that, for sufficient $O(L^2)$ 2, the attention matrix approximation satisfies $O(L^2)$ 3 with high probability (Choromanski et al., 2020).

6. Implementation in Performer and DF-Conformer Architectures

FAVOR+ first appears as the core attention mechanism in Performer (Choromanski et al., 2020), replacing only the self-attention block and maintaining compatibility with the rest of the Transformer architecture (residuals, layer normalization, feed-forward network). In DF-Conformer (Seki et al., 4 Nov 2025), FAVOR+ is embedded in each block's attention sublayer as follows:

For $O(L^2)$ 4 heads and per-head dimension $O(L^2)$ 5, apply random-feature projection $O(L^2)$ 6
Compute random-feature matrices $O(L^2)$ 7, $O(L^2)$ 8
Linearly aggregate values through $O(L^2)$ 9, then combine using $L \gg 10^3$ 0
Per-row normalization with $L \gg 10^3$ 1
Favor+ is often sandwiched between local convolutional modules (e.g., depthwise dilated convolution) for joint modeling of local and global dependencies

Typical hyperparameters include $L \gg 10^3$ 2, $L \gg 10^3$ 3, per-head random features $L \gg 10^3$ 4– $L \gg 10^3$ 5, and rotary embeddings prior to projection. All operations are amenable to efficient GPU/TPU acceleration due to batching and matrix multiplications (Choromanski et al., 2020, Seki et al., 4 Nov 2025).

7. Empirical Behavior, Limitations, and Successors

FAVOR+ enables scaling attention models to sequence lengths of $L \gg 10^3$ 6– $L \gg 10^3$ 7, where baseline Transformers are infeasible due to memory exhaustion. Empirical studies demonstrate linear scaling of compute and memory, up to $L \gg 10^3$ 8 speedup in large- $L \gg 10^3$ 9 regimes, and retention of modeling accuracy on diverse tasks—image generation, text modeling, and protein sequence modeling. In domains such as speech enhancement (Genhancer), FAVOR+ underpins efficient variants like DF-Conformer, though recent research highlights potential gains from replacing FAVOR+ with state-space sequence models for global dependency modeling while retaining linear complexity (Seki et al., 4 Nov 2025).

A salient aspect is the trade-off between unbiased approximation and approximation error: while FAVOR+ reduces computational burden and preserves full-rank attention properties, a finite number of random features can introduce deviations from softmax attention. Fine-tuning the number of random features and redrawing projection matrices can mitigate some limitations. The integration with local convolutions is designed to maintain strong locality without loss of global context (Choromanski et al., 2020, Seki et al., 4 Nov 2025).

References

"Rethinking Attention with Performers" (Choromanski et al., 2020)
"Improving DF-Conformer Using Hydra For High-Fidelity Generative Speech Enhancement on Discrete Codec Token" (Seki et al., 4 Nov 2025)

Markdown Report Issue Upgrade to Chat

References (2)

Rethinking Attention with Performers (2020)

Improving DF-Conformer Using Hydra For High-Fidelity Generative Speech Enhancement on Discrete Codec Token (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FAVOR+ Attention.