2000 character limit reached
Softmax Attention with Constant Cost per Token (2404.05843v2)
Published 8 Apr 2024 in cs.LG and cs.CL
Abstract: We propose a simple modification to the conventional attention mechanism applied by Transformers: Instead of quantifying pairwise query-key similarity with scaled dot-products, we quantify it with the logarithms of scaled dot-products of exponentials. Our modification linearizes attention with exponential kernel feature maps, whose corresponding feature function is infinite dimensional. We show that our modification is expressible as a composition of log-sums of exponentials, with a latent space of constant size, enabling application with constant time and space complexity per token. We implement our modification, verify that it works in practice, and conclude that it is a promising alternative to conventional attention.
- Generating long sequences with sparse transformers. CoRR abs/1904.10509.
- Flashattention: Fast and memory-efficient exact attention with io-awareness.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 .
- Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces.
- Efficiently modeling long sequences with structured state spaces. CoRR abs/2111.00396.
- Repeat after me: Transformers are better than state space models at copying.
- Transformers are rnns: Fast autoregressive transformers with linear attention. CoRR abs/2006.16236.
- Tobias Katsch. 2023. Gateloop: Fully data-controlled linear recurrence for sequence modeling.
- Reformer: The efficient transformer. CoRR abs/2001.04451.
- Fnet: Mixing tokens with fourier transforms. CoRR abs/2105.03824.
- Eric Martin and Chris Cundy. 2017. Parallelizing linear recurrent neural nets over sequence length. CoRR abs/1709.04057.
- Rwkv: Reinventing rnns for the transformer era.
- Hyena hierarchy: Towards larger convolutional language models.
- Efficient content-based sparse attention with routing transformers. CoRR abs/2003.05997.
- Attention is all you need. CoRR abs/1706.03762.
- Linformer: Self-attention with linear complexity. CoRR abs/2006.04768.
- An attention free transformer. CoRR abs/2105.14103.