FAVOR+ Attention in Scalable Transformers
- FAVOR+ Attention is a scalable mechanism that approximates softmax attention using positive orthogonal random features to transform kernelized queries and keys.
- It leverages Monte Carlo methods and orthogonal random feature maps to achieve unbiased, low-variance approximations, reducing time and space complexities from quadratic to linear.
- Empirical results in architectures like Performer and DF-Conformer demonstrate up to 8× speedup and effective modeling of long sequences in various tasks.
FAVOR+ Attention (Fast Attention Via positive Orthogonal Random Features) is a Monte Carlo kernel approximation scheme for scalable attention in neural sequence models, designed to approximate the softmax attention mechanism with provable accuracy while reducing time and space complexity from quadratic to linear in sequence length. This mechanism is central to Performer and subsequent architectures such as DF-Conformer, enabling practical modeling of long sequences in settings where explicit computation of the attention matrix is computationally prohibitive (Choromanski et al., 2020, Seki et al., 4 Nov 2025).
1. Motivation and Problem Setting
Traditional softmax attention in Transformers requires computation and storage of a full attention matrix (where is the sequence length), resulting in time and memory complexity. For tasks with long contexts (e.g., sequence lengths ), these requirements are prohibitively expensive. To address this, FAVOR+ constructs an unbiased, low-variance approximation to the softmax attention kernel, reducing complexity without introducing sparsity or low-rank constraints (Choromanski et al., 2020).
2. Mathematical Foundation: Kernelization of Softmax Attention
The central observation is that softmax attention is kernelizable: with . For any positive-definite kernel expressible as an expectation over features, , the attention mechanism can be approximated in feature space. This approach enables reframing matrix products involving in linear rather than quadratic time, provided a suitable can be constructed (Choromanski et al., 2020, Seki et al., 4 Nov 2025).
3. FAVOR+ Random Feature Maps
FAVOR+ uses positive orthogonal random features to approximate the exponential kernel. The construction proceeds as follows:
- For each input , and a set of orthogonal random vectors , define
- For reduced variance, a "hyp+" variant uses pairs and outputs
- Orthogonality of is achieved by sampling a Gaussian matrix, then orthonormalizing its rows (gram–schmidt, Householder, or Hadamard methods).
These features are positive-valued, preserve unbiasedness, and crucially reduce variance relative to i.i.d. sampled features (Choromanski et al., 2020, Seki et al., 4 Nov 2025).
4. Linear-Time Attention via FAVOR+
Given matrices , the transformation proceeds as follows:
- Compute random-feature projections:
- Approximate the attention output:
where
Time and space complexity per block become and , with typically $256$–$1024$, representing a substantial reduction compared to the and baseline (Choromanski et al., 2020, Seki et al., 4 Nov 2025).
5. Theoretical Guarantees
FAVOR+ is a provably unbiased estimator: Variance bounds are established, with mean-squared error for the approximation decreasing as . Orthogonal features further reduce variance, with
Uniform convergence guarantees ensure that, for sufficient , the attention matrix approximation satisfies with high probability (Choromanski et al., 2020).
6. Implementation in Performer and DF-Conformer Architectures
FAVOR+ first appears as the core attention mechanism in Performer (Choromanski et al., 2020), replacing only the self-attention block and maintaining compatibility with the rest of the Transformer architecture (residuals, layer normalization, feed-forward network). In DF-Conformer (Seki et al., 4 Nov 2025), FAVOR+ is embedded in each block's attention sublayer as follows:
- For heads and per-head dimension , apply random-feature projection
- Compute random-feature matrices ,
- Linearly aggregate values through , then combine using
- Per-row normalization with
- Favor+ is often sandwiched between local convolutional modules (e.g., depthwise dilated convolution) for joint modeling of local and global dependencies
Typical hyperparameters include , , per-head random features –$256$, and rotary embeddings prior to projection. All operations are amenable to efficient GPU/TPU acceleration due to batching and matrix multiplications (Choromanski et al., 2020, Seki et al., 4 Nov 2025).
7. Empirical Behavior, Limitations, and Successors
FAVOR+ enables scaling attention models to sequence lengths of –, where baseline Transformers are infeasible due to memory exhaustion. Empirical studies demonstrate linear scaling of compute and memory, up to speedup in large- regimes, and retention of modeling accuracy on diverse tasks—image generation, text modeling, and protein sequence modeling. In domains such as speech enhancement (Genhancer), FAVOR+ underpins efficient variants like DF-Conformer, though recent research highlights potential gains from replacing FAVOR+ with state-space sequence models for global dependency modeling while retaining linear complexity (Seki et al., 4 Nov 2025).
A salient aspect is the trade-off between unbiased approximation and approximation error: while FAVOR+ reduces computational burden and preserves full-rank attention properties, a finite number of random features can introduce deviations from softmax attention. Fine-tuning the number of random features and redrawing projection matrices can mitigate some limitations. The integration with local convolutions is designed to maintain strong locality without loss of global context (Choromanski et al., 2020, Seki et al., 4 Nov 2025).
References
- "Rethinking Attention with Performers" (Choromanski et al., 2020)
- "Improving DF-Conformer Using Hydra For High-Fidelity Generative Speech Enhancement on Discrete Codec Token" (Seki et al., 4 Nov 2025)