FAVOR+: Positive Random Fourier Features
- FAVOR+ is a positive random feature map that uses exponentiated Gaussian projections to unbiasedly approximate the softmax kernel while ensuring nonnegative features.
- It provides an unbiased Monte Carlo estimator for exponential kernels, making it valuable for scalable self-attention in Transformer models.
- Despite its computational efficiency, FAVOR+ exhibits high variance compared to advanced methods, motivating tunable alternatives like GERFs and DERFs.
Positive Random Fourier Features (FAVOR+), introduced in the Performer framework and analyzed in FAVOR# (Likhosherstov et al., 2023), are a class of positive, non-trigonometric random feature maps designed for efficient and unbiased approximation of exponential kernels, notably the softmax kernel . Unlike classical Random Fourier Features (RFFs) that leverage sine and cosine transforms, FAVOR+ constructs strictly positive random features by exponentiating random linear projections. This approach ensures entrywise nonnegativity in kernel and attention approximations, maintaining stability after row-wise renormalization—an essential property for scalable linear attention mechanisms in Transformer models.
1. Definition and Construction of FAVOR+
FAVOR+ is defined by the positive random feature map
where and . Given independent samples , the -dimensional feature vector is
The approximate kernel value between and is then computed as
0
Unlike the original Rahimi–Recht RFFs, which use trigonometric functions and can generate negative values, FAVOR+'s features are always positive. This ensures the resulting Gram matrix is entrywise nonnegative, so the standard row-normalization step preserves the stochastic property required in softmax approximations.
2. Unbiasedness and Recovery of the Softmax Kernel
The construction of FAVOR+ yields an unbiased Monte Carlo estimator for the softmax kernel:
1
This is established by noting:
2
by direct computation of the Gaussian integral after completing the square in the exponent.
The same methodology extends to the Gaussian kernel 3, by using a simple re-weighting of 4.
3. Variance Properties and Comparisons
FAVOR+ exhibits high variance compared to alternative random feature methods. The variance of 5 is
6
There are no tunable parameters in 7, so analytic variance reduction is not possible in this method. In empirical benchmarks (Section 5.1, Figure 1), the per-pair relative variance
8
for FAVOR+ is in the range 9--0 on CIFAR-10–derived samples at 1, orders of magnitude higher than the newly introduced SDERF, which achieves relative variance below 2. This large variance underlies the motivation for richer, parameterized feature maps such as GERFs and DERFs, which admit closed-form variance minimization and achieve variance reduction by factors exceeding 3.
4. Self-Attention Approximation in Transformers via FAVOR+
In standard Transformers, the 4 self-attention matrix is
5
where 6, 7 are the sequences of queries and keys. FAVOR+ replaces explicit softmax computation with a low-rank approximation:
- For each query 8 and key 9, compute 0 and 1.
- Construct matrices 2 and 3.
- Form the rank-4 estimate 5.
- Compute the layer output as 6.
This approach reduces self-attention complexity from 7 to 8 while guaranteeing a valid stochastic matrix after normalization due to entrywise positivity.
5. Performance Benchmarks and Empirical Results
FAVOR+ demonstrates consistently higher empirical variance and lower accuracy compared to richer random feature classes such as GERFs, SADERF, and SDERF. In kernel regression tasks on UCI classification benchmarks (Section 5.2), FAVOR+ attains accuracy several percentage points below FAVOR#, which employs SDERF. In long-sequence Transformer applications:
- On the LibriSpeech Conformer-Transducer, the normalized word error rate (NWER) for FAVOR+ is higher than FAVOR++ at small 9, while FAVOR# achieves up to a 0 absolute NWER reduction versus FAVOR++.
- On GLUE natural language tasks, FAVOR+ underperforms other linear attention methods (ELU, ReLU, FAVOR++, FAVOR#) by 1–2 points on average, while FAVOR# recovers within 3–4 points of the full softmax baseline.
A summary of variance and accuracy comparisons across methods:
| Method | Relative Variance (RV) | Test Accuracy (UCI) | GLUE Average |
|---|---|---|---|
| FAVOR+ | 5--6 | Several % below FAVOR# | 3–5 pts below other variants |
| SDERF | 71 | Highest | 0.2–0.5 pts from baseline |
FAVOR# (using SDERF) achieves variance reductions by factors of 8 and substantial performance gains relative to FAVOR+.
6. Limitations and Motivations for Generalized Random Features
Despite its unbiasedness and computational advantages, FAVOR+ is limited by excessive variance, a consequence of the lack of tunable parameters in 9. This restricts its practical accuracy in both classical (kernel regression) and modern (Transformer) settings. The introduction of Generalized Exponential RFs (GERFs) and Dense-Exponential RFs (DERFs), which enable variance minimization by learning scalar or matrix parameters in the exponent, addresses the principal shortcomings of FAVOR+ and underlies the advances realized in FAVOR# (Likhosherstov et al., 2023).