FAVOR+: Positive Random Fourier Features

Updated 2 May 2026

FAVOR+ is a positive random feature map that uses exponentiated Gaussian projections to unbiasedly approximate the softmax kernel while ensuring nonnegative features.
It provides an unbiased Monte Carlo estimator for exponential kernels, making it valuable for scalable self-attention in Transformer models.
Despite its computational efficiency, FAVOR+ exhibits high variance compared to advanced methods, motivating tunable alternatives like GERFs and DERFs.

Positive Random Fourier Features (FAVOR+), introduced in the Performer framework and analyzed in FAVOR# (Likhosherstov et al., 2023), are a class of positive, non-trigonometric random feature maps designed for efficient and unbiased approximation of exponential kernels, notably the softmax kernel $K(x, y) = \exp(x^\top y)$ . Unlike classical Random Fourier Features (RFFs) that leverage sine and cosine transforms, FAVOR+ constructs strictly positive random features by exponentiating random linear projections. This approach ensures entrywise nonnegativity in kernel and attention approximations, maintaining stability after row-wise renormalization—an essential property for scalable linear attention mechanisms in Transformer models.

1. Definition and Construction of FAVOR+

FAVOR+ is defined by the positive random feature map

$f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)$

where $\omega \sim \mathcal{N}(0, I_d)$ and $x \in \mathbb{R}^d$ . Given $M$ independent samples $\omega^{(1)},\ldots,\omega^{(M)}$ , the $M$ -dimensional feature vector is

$\varphi(x) = \frac{1}{\sqrt{M}} \left[ \exp\left((\omega^{(m)})^\top x - \|x\|^2/2\right) \right]_{m=1}^M.$

The approximate kernel value between $x$ and $y$ is then computed as

$f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)$ 0

Unlike the original Rahimi–Recht RFFs, which use trigonometric functions and can generate negative values, FAVOR+'s features are always positive. This ensures the resulting Gram matrix is entrywise nonnegative, so the standard row-normalization step preserves the stochastic property required in softmax approximations.

2. Unbiasedness and Recovery of the Softmax Kernel

The construction of FAVOR+ yields an unbiased Monte Carlo estimator for the softmax kernel:

$f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)$ 1

This is established by noting:

$f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)$ 2

by direct computation of the Gaussian integral after completing the square in the exponent.

The same methodology extends to the Gaussian kernel $f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)$ 3, by using a simple re-weighting of $f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)$ 4.

3. Variance Properties and Comparisons

FAVOR+ exhibits high variance compared to alternative random feature methods. The variance of $f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)$ 5 is

$f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)$ 6

There are no tunable parameters in $f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)$ 7, so analytic variance reduction is not possible in this method. In empirical benchmarks (Section 5.1, Figure 1), the per-pair relative variance

$f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)$ 8

for FAVOR+ is in the range $f_{\mathrm{pos}}(\omega, x) = \exp(\omega^\top x - \|x\|^2/2)$ 9-- $\omega \sim \mathcal{N}(0, I_d)$ 0 on CIFAR-10–derived samples at $\omega \sim \mathcal{N}(0, I_d)$ 1, orders of magnitude higher than the newly introduced SDERF, which achieves relative variance below $\omega \sim \mathcal{N}(0, I_d)$ 2. This large variance underlies the motivation for richer, parameterized feature maps such as GERFs and DERFs, which admit closed-form variance minimization and achieve variance reduction by factors exceeding $\omega \sim \mathcal{N}(0, I_d)$ 3.

4. Self-Attention Approximation in Transformers via FAVOR+

In standard Transformers, the $\omega \sim \mathcal{N}(0, I_d)$ 4 self-attention matrix is

$\omega \sim \mathcal{N}(0, I_d)$ 5

where $\omega \sim \mathcal{N}(0, I_d)$ 6, $\omega \sim \mathcal{N}(0, I_d)$ 7 are the sequences of queries and keys. FAVOR+ replaces explicit softmax computation with a low-rank approximation:

For each query $\omega \sim \mathcal{N}(0, I_d)$ 8 and key $\omega \sim \mathcal{N}(0, I_d)$ 9, compute $x \in \mathbb{R}^d$ 0 and $x \in \mathbb{R}^d$ 1.
Construct matrices $x \in \mathbb{R}^d$ 2 and $x \in \mathbb{R}^d$ 3.
Form the rank- $x \in \mathbb{R}^d$ 4 estimate $x \in \mathbb{R}^d$ 5.
Compute the layer output as $x \in \mathbb{R}^d$ 6.

This approach reduces self-attention complexity from $x \in \mathbb{R}^d$ 7 to $x \in \mathbb{R}^d$ 8 while guaranteeing a valid stochastic matrix after normalization due to entrywise positivity.

5. Performance Benchmarks and Empirical Results

FAVOR+ demonstrates consistently higher empirical variance and lower accuracy compared to richer random feature classes such as GERFs, SADERF, and SDERF. In kernel regression tasks on UCI classification benchmarks (Section 5.2), FAVOR+ attains accuracy several percentage points below FAVOR#, which employs SDERF. In long-sequence Transformer applications:

On the LibriSpeech Conformer-Transducer, the normalized word error rate (NWER) for FAVOR+ is higher than FAVOR++ at small $x \in \mathbb{R}^d$ 9, while FAVOR# achieves up to a $M$ 0 absolute NWER reduction versus FAVOR++.
On GLUE natural language tasks, FAVOR+ underperforms other linear attention methods (ELU, ReLU, FAVOR++, FAVOR#) by $M$ 1– $M$ 2 points on average, while FAVOR# recovers within $M$ 3– $M$ 4 points of the full softmax baseline.

A summary of variance and accuracy comparisons across methods:

Method	Relative Variance (RV)	Test Accuracy (UCI)	GLUE Average
FAVOR+	$M$ 5-- $M$ 6	Several % below FAVOR#	3–5 pts below other variants
SDERF	$M$ 71	Highest	0.2–0.5 pts from baseline

FAVOR# (using SDERF) achieves variance reductions by factors of $M$ 8 and substantial performance gains relative to FAVOR+.

6. Limitations and Motivations for Generalized Random Features

Despite its unbiasedness and computational advantages, FAVOR+ is limited by excessive variance, a consequence of the lack of tunable parameters in $M$ 9. This restricts its practical accuracy in both classical (kernel regression) and modern (Transformer) settings. The introduction of Generalized Exponential RFs (GERFs) and Dense-Exponential RFs (DERFs), which enable variance minimization by learning scalar or matrix parameters in the exponent, addresses the principal shortcomings of FAVOR+ and underlies the advances realized in FAVOR# (Likhosherstov et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

FAVOR#: Sharp Attention Kernel Approximations via New Classes of Positive Random Features (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Positive Random Fourier Features (FAVOR+).