Papers
Topics
Authors
Recent
Search
2000 character limit reached

FAVOR+ Attention in Scalable Transformers

Updated 5 January 2026
  • FAVOR+ Attention is a scalable mechanism that approximates softmax attention using positive orthogonal random features to transform kernelized queries and keys.
  • It leverages Monte Carlo methods and orthogonal random feature maps to achieve unbiased, low-variance approximations, reducing time and space complexities from quadratic to linear.
  • Empirical results in architectures like Performer and DF-Conformer demonstrate up to 8× speedup and effective modeling of long sequences in various tasks.

FAVOR+ Attention (Fast Attention Via positive Orthogonal Random Features) is a Monte Carlo kernel approximation scheme for scalable attention in neural sequence models, designed to approximate the softmax attention mechanism with provable accuracy while reducing time and space complexity from quadratic to linear in sequence length. This mechanism is central to Performer and subsequent architectures such as DF-Conformer, enabling practical modeling of long sequences in settings where explicit computation of the attention matrix is computationally prohibitive (Choromanski et al., 2020, Seki et al., 4 Nov 2025).

1. Motivation and Problem Setting

Traditional softmax attention in Transformers requires computation and storage of a full L×LL \times L attention matrix (where LL is the sequence length), resulting in O(L2d)O(L^2 d) time and O(L2)O(L^2) memory complexity. For tasks with long contexts (e.g., sequence lengths L103L \gg 10^3), these requirements are prohibitively expensive. To address this, FAVOR+ constructs an unbiased, low-variance approximation to the softmax attention kernel, reducing complexity without introducing sparsity or low-rank constraints (Choromanski et al., 2020).

2. Mathematical Foundation: Kernelization of Softmax Attention

The central observation is that softmax attention is kernelizable: Aij=exp(qikjd)=K(qi,kj)A_{ij} = \exp\left( \frac{q_i \cdot k_j}{\sqrt{d}} \right) = K(q_i, k_j) with K(x,y)=exp(xy)K(x, y) = \exp(x^\top y). For any positive-definite kernel K(x,y)K(x, y) expressible as an expectation over features, K(x,y)=Eω[ϕ(x;ω)ϕ(y;ω)]K(x, y) = \mathbb{E}_\omega[\phi(x; \omega)\phi(y; \omega)], the attention mechanism can be approximated in feature space. This approach enables reframing matrix products involving AA in linear rather than quadratic time, provided a suitable ϕ\phi can be constructed (Choromanski et al., 2020, Seki et al., 4 Nov 2025).

3. FAVOR+ Random Feature Maps

FAVOR+ uses positive orthogonal random features to approximate the exponential kernel. The construction proceeds as follows:

  • For each input xRdx \in \mathbb R^d, and a set of mm orthogonal random vectors ω1,,ωm\omega_1, \dots, \omega_m, define

ϕωi+(x)=exp(ωixx2/2),i=1,,m\phi_{\omega_i}^+(x) = \exp(\omega_i^\top x - \|x\|^2/2), \quad i=1,\dots,m

  • For reduced variance, a "hyp+" variant uses pairs (ωi,ωi)(\omega_i, -\omega_i) and outputs

ϕωihyp+(x)=12(exp(ωixx2/2),exp(ωixx2/2))\phi^{\mathrm{hyp+}}_{\omega_i}(x) = \frac{1}{\sqrt{2}} \left( \exp(\omega_i^\top x - \|x\|^2/2), \exp(-\omega_i^\top x - \|x\|^2/2) \right)

  • Orthogonality of ωi\omega_i is achieved by sampling a Gaussian matrix, then orthonormalizing its rows (gram–schmidt, Householder, or Hadamard methods).

These features are positive-valued, preserve unbiasedness, and crucially reduce variance relative to i.i.d. sampled features (Choromanski et al., 2020, Seki et al., 4 Nov 2025).

4. Linear-Time Attention via FAVOR+

Given matrices Q,K,VRL×dQ, K, V \in \mathbb{R}^{L \times d}, the transformation proceeds as follows:

  1. Compute random-feature projections:

Q=ϕ(Q)RL×r,K=ϕ(K)RL×rQ' = \phi(Q) \in \mathbb{R}^{L \times r}, \qquad K' = \phi(K) \in \mathbb{R}^{L \times r}

  1. Approximate the attention output:

approxAtt(Q,K,V)=D1Q(KV)\mathrm{approxAtt}(Q, K, V) = D^{-1} Q' (K'^\top V)

where D=diag(Q(K1L))D = \mathrm{diag}(Q' (K'^\top 1_L))

Time and space complexity per block become O(Lrd)O(L r d) and O(Lr)O(L r), with rr typically $256$–$1024$, representing a substantial reduction compared to the O(L2d)O(L^2 d) and O(L2)O(L^2) baseline (Choromanski et al., 2020, Seki et al., 4 Nov 2025).

5. Theoretical Guarantees

FAVOR+ is a provably unbiased estimator: Eω[ϕω(x)ϕω(y)]=exp(xy)\mathbb{E}_{\omega}[\phi_\omega(x)\phi_\omega(y)] = \exp(x^\top y) Variance bounds are established, with mean-squared error for the approximation decreasing as O(1/m)O(1/m). Orthogonal features further reduce variance, with

VarORF+VarPRF+Δ(d,m)\mathrm{Var}_{\mathrm{ORF}+} \leq \mathrm{Var}_{\mathrm{PRF}+} - \Delta(d, m)

Uniform convergence guarantees ensure that, for sufficient m=O(d/δ2log(σdiam/δ))m = O(d/\delta^2 \cdot \log(\sigma \cdot \mathrm{diam} / \delta)), the attention matrix approximation satisfies maxi,jA^ijAijε\max_{i,j} |{\hat A_{ij} - A_{ij}}| \leq \varepsilon with high probability (Choromanski et al., 2020).

6. Implementation in Performer and DF-Conformer Architectures

FAVOR+ first appears as the core attention mechanism in Performer (Choromanski et al., 2020), replacing only the self-attention block and maintaining compatibility with the rest of the Transformer architecture (residuals, layer normalization, feed-forward network). In DF-Conformer (Seki et al., 4 Nov 2025), FAVOR+ is embedded in each block's attention sublayer as follows:

  • For hh heads and per-head dimension dh=d/hd_h = d/h, apply random-feature projection WiRr×dhW_i \in \mathbb{R}^{r \times d_h}
  • Compute random-feature matrices ΦQi\Phi_{Q_i}, ΦKi\Phi_{K_i}
  • Linearly aggregate values through ΦKiVi\Phi_{K_i}^\top V_i, then combine using ΦQi\Phi_{Q_i}
  • Per-row normalization with Di,t=j=1T[ΦQi]t:[ΦKi]j:TD_{i, t} = \sum_{j=1}^T [\Phi_{Q_i}]_{t:} [\Phi_{K_i}]_{j:}^T
  • Favor+ is often sandwiched between local convolutional modules (e.g., depthwise dilated convolution) for joint modeling of local and global dependencies

Typical hyperparameters include d{256,512}d \in \{256, 512\}, h{4,8}h \in \{4, 8\}, per-head random features r128r \approx 128–$256$, and rotary embeddings prior to projection. All operations are amenable to efficient GPU/TPU acceleration due to batching and matrix multiplications (Choromanski et al., 2020, Seki et al., 4 Nov 2025).

7. Empirical Behavior, Limitations, and Successors

FAVOR+ enables scaling attention models to sequence lengths of 8 k8~\mathrm{k}32 k32~\mathrm{k}, where baseline Transformers are infeasible due to memory exhaustion. Empirical studies demonstrate linear scaling of compute and memory, up to 8×8\times speedup in large-LL regimes, and retention of modeling accuracy on diverse tasks—image generation, text modeling, and protein sequence modeling. In domains such as speech enhancement (Genhancer), FAVOR+ underpins efficient variants like DF-Conformer, though recent research highlights potential gains from replacing FAVOR+ with state-space sequence models for global dependency modeling while retaining linear complexity (Seki et al., 4 Nov 2025).

A salient aspect is the trade-off between unbiased approximation and approximation error: while FAVOR+ reduces computational burden and preserves full-rank attention properties, a finite number of random features can introduce deviations from softmax attention. Fine-tuning the number of random features and redrawing projection matrices can mitigate some limitations. The integration with local convolutions is designed to maintain strong locality without loss of global context (Choromanski et al., 2020, Seki et al., 4 Nov 2025).


References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FAVOR+ Attention.