Rethinking Attention with Performers (2009.14794v4)

Published 30 Sep 2020 in cs.LG, cs.CL, and stat.ML

Abstract: We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

PDF Abstract

A Comprehensive Overview of "Rethinking Attention with Performers"

Introduction

Transformers have established themselves as a dominant architecture in various fields of machine learning, such as NLP, bioinformatics, and more. The key innovation behind Transformers is the attention mechanism, which allows the model to weigh the relevance of different tokens in a sequence dynamically. However, a significant limitation of traditional Transformers is their quadratic time and space complexity with respect to the input length $L$ . This makes them computationally prohibitive for long sequences. The paper "Rethinking Attention with Performers" introduces Performers, a novel Transformer variant designed to overcome this scalability issue while preserving the expressivity and accuracy of the original architecture.

Key Contributions

Scaled Attention via FAVOR+ Algorithm:
- The paper introduces the Fast Attention Via positive Orthogonal Random features (FAVOR+) mechanism, which approximates the softmax attention mechanism with linear space and time complexity $O(Lrd)$ , where $r$ is a hyperparameter.
- FAVOR+ uses positive orthogonal random features (PRFs) to ensure non-negative attention weights, thus avoiding the instability issues of methods using standard trigonometric features.
Theoretical Guarantees:
- The authors provide rigorous theoretical guarantees for the FAVOR+ mechanism, including unbiased estimation of the attention matrix, uniform convergence, and low estimation variance.
- Orthogonal Random Features (ORFs) are employed to further reduce the variance of the approximation, which is especially valuable for high-dimensional spaces.
Experimental Validation:
- Performers are tested on a diverse set of tasks including text modeling, image generation, and protein sequence modeling.
- The results demonstrate competitive performance against state-of-the-art sparse and dense attention methods, underlining the practical viability of FAVOR+.

Numerical Results and Significant Observations

Efficiency: Performers exhibit nearly linear time complexity for large input sequences. For instance, on a V100 GPU with 16GB memory, Performers handle sequences of length up to $2^{15}$ (32,768), while traditional Transformers fail far below this threshold.
Accuracy: When tested on protein datasets, Performers matched or even outperformed traditional Transformers in both unidirectional and bidirectional settings. Performers achieved a test set accuracy of 31.58% for unidirectional modeling compared to 30.80% for traditional Transformers, highlighting the robustness of the FAVOR+ mechanism.

Implications and Future Directions

Practical Implications:

Scalability: By reducing the computational burden associated with the attention mechanism, Performers can democratize the use of Transformers in scenarios with limited computational resources. This could lead to broader adoption in industry settings and allow for more extensive use in research areas requiring long sequence modeling.
Energy Efficiency: The reduced computational requirements directly translate to lower energy consumption, contributing to more sustainable AI practices.

Theoretical Implications:

Approximation Theory: The paper extends the theoretical understanding of kernel approximation techniques, particularly the use of PRFs in high-dimensional spaces. This can inspire further research into other forms of non-linear, kernel-based approximations within neural network architectures.
Variability and Robustness: Through detailed analysis, the paper shows that ORFs provide robust estimations that are less susceptible to outliers—a valuable property for machine learning models that need to process noisy or variable data.

Future Research Directions:

Extension to Other Architectures: While the paper focuses on Transformers, the FAVOR+ mechanism could potentially be adapted to other architectures that utilize attention mechanisms, such as Graph Neural Networks (GNNs).
Hybrid Models: Combining FAVOR+ with other techniques like reversible layers or cluster-based attention could lead to new hybrid models that leverage the strengths of multiple approximation methods.
Further Optimization: Although PRFs offer significant benefits, further optimization and exploration of different random feature maps could yield even more efficient and accurate attention mechanisms.

Conclusion

The paper "Rethinking Attention with Performers" makes substantial strides towards addressing the scalability issues associated with traditional Transformers. By introducing the FAVOR+ mechanism, the authors not only enhance the practical applicability of Transformers to long sequences but also provide significant theoretical contributions to the understanding of scalable attention mechanisms. This work paves the way for future research and has the potential to significantly impact various fields reliant on sequence modeling.