Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Softmax Attention with Constant Cost per Token (2404.05843v2)

Published 8 Apr 2024 in cs.LG and cs.CL

Abstract: We propose a simple modification to the conventional attention mechanism applied by Transformers: Instead of quantifying pairwise query-key similarity with scaled dot-products, we quantify it with the logarithms of scaled dot-products of exponentials. Our modification linearizes attention with exponential kernel feature maps, whose corresponding feature function is infinite dimensional. We show that our modification is expressible as a composition of log-sums of exponentials, with a latent space of constant size, enabling application with constant time and space complexity per token. We implement our modification, verify that it works in practice, and conclude that it is a promising alternative to conventional attention.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. Generating long sequences with sparse transformers. CoRR abs/1904.10509.
  2. Flashattention: Fast and memory-efficient exact attention with io-awareness.
  3. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 .
  4. Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces.
  5. Efficiently modeling long sequences with structured state spaces. CoRR abs/2111.00396.
  6. Repeat after me: Transformers are better than state space models at copying.
  7. Transformers are rnns: Fast autoregressive transformers with linear attention. CoRR abs/2006.16236.
  8. Tobias Katsch. 2023. Gateloop: Fully data-controlled linear recurrence for sequence modeling.
  9. Reformer: The efficient transformer. CoRR abs/2001.04451.
  10. Fnet: Mixing tokens with fourier transforms. CoRR abs/2105.03824.
  11. Eric Martin and Chris Cundy. 2017. Parallelizing linear recurrent neural nets over sequence length. CoRR abs/1709.04057.
  12. Rwkv: Reinventing rnns for the transformer era.
  13. Hyena hierarchy: Towards larger convolutional language models.
  14. Efficient content-based sparse attention with routing transformers. CoRR abs/2003.05997.
  15. Attention is all you need. CoRR abs/1706.03762.
  16. Linformer: Self-attention with linear complexity. CoRR abs/2006.04768.
  17. An attention free transformer. CoRR abs/2105.14103.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets