FLuRKA: Fast and accurate unified Low-Rank & Kernel Attention (2306.15799v2)
Abstract: Many efficient $\textit{approximate}$ self-attention techniques have become prevalent since the inception of the transformer architecture. Two popular classes of these techniques are low-rank and kernel methods. Each of these methods has its strengths. We observe these strengths synergistically complement each other and exploit them to fuse low-rank and kernel methods, producing a new class of transformers: FLuRKA ($\textbf{F}$ast $\textbf{L}$ow-$\textbf{R}$ank & $\textbf{K}$ernel$ \textbf{A}$ttention). FLuRKA are highly $\textit{training-efficient}$ with faster model speeds $\textit{and}$ similar model qualities compared to constituent low-rank and kernel methods. We theoretically and empirically evaluate the speed and quality of FLuRKA. Our model speed analysis posits a variety of parameter configurations where FLuRKA exhibit speedups over low-rank and kernel approximations and our model quality analysis bounds the error of FLuRKA with respect to full-attention. Empirically, we instantiate three FLuRKA variants which experience speedups of up to 3.3x and 1.7x over low-rank and kernel methods respectively. This translates to speedups of up to 20x over models with flash-attention. Across a diverse set of tasks spanning language modeling, language understanding, long sequence modeling, machine translation, and image classification, FLuRKA achieve comparable accuracy with underlying low-rank and kernel approximations, occasionally surpassing both.
- Anthropic. Introducing 100k context windows, 2023. URL https://www.anthropic.com/index/100k-context-windows.
- Longformer: The long-document transformer. CoRR, abs/2004.05150, 2020. URL https://arxiv.org/abs/2004.05150.
- JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
- Scatterbrain: Unifying sparse and low-rank attention approximation. CoRR, abs/2110.15343, 2021. URL https://arxiv.org/abs/2110.15343.
- Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019. URL http://arxiv.org/abs/1904.10509.
- Rethinking attention with performers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH.
- Palm: Scaling language modeling with pathways, 2022.
- Adaptively sparse transformers. CoRR, abs/1909.00015, 2019. URL http://arxiv.org/abs/1909.00015.
- Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961, 2021. URL https://arxiv.org/abs/2101.03961.
- Training compute-optimal large language models, 2022.
- Transformer quality in linear time, 2022.
- Transformers are rnns: Fast autoregressive transformers with linear attention. CoRR, abs/2006.16236, 2020. URL https://arxiv.org/abs/2006.16236.
- Reformer: The efficient transformer. CoRR, abs/2001.04451, 2020. URL https://arxiv.org/abs/2001.04451.
- Pointer sentinel mixture models. CoRR, abs/1609.07843, 2016. URL http://arxiv.org/abs/1609.07843.
- OpenAI. Gpt-4 technical report, 2023.
- Pytorch: An imperative style, high-performance deep learning library. CoRR, abs/1912.01703, 2019. URL http://arxiv.org/abs/1912.01703.
- Random feature attention. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=QtTKTdVrFBB.
- Blockwise self-attention for long document understanding. CoRR, abs/1911.02972, 2019. URL http://arxiv.org/abs/1911.02972.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
- A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. URL https://proceedings.neurips.cc/paper_files/paper/2007/file/013a006f03dbc5392effeb8f18fda755-Paper.pdf.
- Efficient content-based sparse attention with routing transformers. CoRR, abs/2003.05997, 2020. URL https://arxiv.org/abs/2003.05997.
- Synthesizer: Rethinking self-attention in transformer models. CoRR, abs/2005.00743, 2020a. URL https://arxiv.org/abs/2005.00743.
- Sparse sinkhorn attention. CoRR, abs/2002.11296, 2020b. URL https://arxiv.org/abs/2002.11296.
- Efficient transformers: A survey. CoRR, abs/2009.06732, 2020c. URL https://arxiv.org/abs/2009.06732.
- Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR, abs/1804.07461, 2018. URL http://arxiv.org/abs/1804.07461.
- Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768, 2020. URL https://arxiv.org/abs/2006.04768.
- Nyströmformer: A nyström-based algorithm for approximating self-attention. CoRR, abs/2102.03902, 2021. URL https://arxiv.org/abs/2102.03902.
- Big bird: Transformers for longer sequences. CoRR, abs/2007.14062, 2020. URL https://arxiv.org/abs/2007.14062.
- Linear complexity randomized self-attention mechanism. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 27011–27041. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/zheng22b.html.
- Efficient attention via control variates. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=G-uNfHKrj46.
- Long-short transformer: Efficient transformers for language and vision. CoRR, abs/2107.02192, 2021. URL https://arxiv.org/abs/2107.02192.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.