Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Latte: Latent Attention for Linear Time Transformers (2402.17512v4)

Published 27 Feb 2024 in cs.CL and stat.ML
Latte: Latent Attention for Linear Time Transformers

Abstract: The time complexity of the standard attention mechanism in transformers scales quadratically with sequence length. We propose a probabilistic framework for attention, enabling us to derive a novel low-rank linear re-parameterisation of both bidirectional and causal cases, based on defining a latent variable model. Our method can be seamlessly integrated as a drop-in replacement for the standard attention mechanism. Additionally, this framework provides a natural extension for combining local standard attention with our global linear attention. This approach allows us to extend the context length of existing large pre-trained models with only a few additional training steps. The resulting ``Latte Transformer'' achieves performance comparable to standard attention and other state-of-the-art models, while maintaining linear time and memory complexity, along with constant-time next-token prediction during inference.

Latent Attention for Efficient Transformer Models

Overview of Latent Attention Mechanism

The paper introduces a novel attention mechanism, termed Latte (Latent Attention), designed to significantly reduce the computational complexity associated with the standard attention mechanism used in transformer models. The quintessential challenge it addresses is the quadratic scaling of time and space complexity with the sequence length in traditional attention mechanisms, which inhibits the practical application of transformers to long sequences. Latte accomplishes a linear scaling with sequence length through the introduction of latent vectors that mediate the attention process. This allows for both bidirectional and unidirectional applications, with the causal variant being especially suited for language generation tasks due to its efficient inference capabilities.

Latte Attention Explained

Latte redefines the attention mechanism by comparing sequence elements (tokens) with a fixed set of learned latent tokens, instead of performing all pairwise comparisons between tokens in the sequence. This adjustment not only diminishes computational and memory requirements but also preserves the intuitive understanding of attention—focusing on different parts of the input based on similarity to concepts represented by the latent tokens.

For the non-causal (bidirectional) variant, the mechanism projects input tokens into a latent space where the attention weights are computed as a mixture over latent states, effectively summarizing the input sequence's information. The causal version, crucial for tasks such as language generation, operates similarly but respects the ordering of the sequence, ensuring that only past information is used at each step.

Computational Complexity

One of the central achievements of Latte is its ability to maintain a linear computational complexity concerning both time and space. For bidirectional tasks, this efficiency enables handling significantly longer sequences than conventional attention mechanisms permit. The paper meticulously contrasts the complexity of Latte against the standard attention, demonstrating its superior efficiency without a substantial loss in performance.

In the causal context, Latte's design allows the recursive computation of attention weights and the transformed latent representations, contributing to its efficacy and scalability in generative tasks. Notably, the ability to infer future tokens directly from a compact set of current and past representations signifies a departure from the typical quadratic complexity, presenting a potential for real-time applications in language generation.

Experimental Evaluation

The paper reports empirical evaluations of Latte on a suite of tasks designed to test both bidirectional and unidirectional capabilities. For bidirectional tasks, it leverages the Long-Range Arena (LRA) benchmark, showing competitive or superior performance compared to both the standard transformer and other efficient transformer variants. In language generation scenarios, Latte is tested on the OpenWebText and Enwik8 datasets, manifesting comparable effectiveness to standard transformers while significantly reducing the computational burden.

Implications and Future Directions

The introduction of Latte radically shifts the landscape for designing efficient transformer models, offering a promising direction for both theoretical exploration and practical applications. Its ability to substantially reduce computational requirements without sacrificing performance paves the way for deploying more sophisticated NLP models in resource-constrained environments. Future work could explore the application of Latte in an even broader range of tasks, including those outside NLP, and further optimization of the latent attention mechanism for improved performance and efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. ETC: Encoding Long and Structured Inputs in Transformers. arXiv preprint arXiv:2004.08483, 2020.
  2. Longformer: The Long-Document Transformer. arXiv preprint arXiv:2004.05150, 2020.
  3. JAX: Composable Transformations of Python+NumPy Programs, 2018. URL http://github.com/google/jax.
  4. Rethinking Attention with Performers. arXiv preprint arXiv:2009.14794, 2020.
  5. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv preprint arXiv:1901.02860, 2019.
  6. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.
  7. Hungry Hungry Hippos: Towards Language Modeling with State Space Models. arXiv preprint arXiv:2212.14052, 2022.
  8. OpenWebText Corpus, 2019. URL http://Skylion007.github.io/OpenWebTextCorpus.
  9. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv preprint arXiv:2111.00396, 2021.
  10. Hutter, M. The Human Knowledge Compression Prize, 2002. URL https://www.kurzweilai.net/hutter-prize-for-lossless-compression-of-human-knowledge.
  11. Perceiver: General Perception with Iterative Attention. In International Conference on Machine Learning, pp.  4651–4664. PMLR, 2021.
  12. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In International Conference on Machine Learning, pp.  5156–5165. PMLR, 2020.
  13. Transformers in Vision: A Survey. ACM Computing Surveys (CSUR), 54(10s):1–41, 2022.
  14. Reformer: The Efficient Transformer. arXiv preprint arXiv:2001.04451, 2020.
  15. Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, 2009.
  16. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, 2011.
  17. Self-attention Does Not Need O⁢(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) Memory. arXiv preprint arXiv:2112.05682, 2021.
  18. Language Models are Unsupervised Multitask Learners. OpenAI blog, 1(8):9, 2019.
  19. Simplified State Space Layers for Sequence Modeling. arXiv preprint arXiv:2208.04933, 2022.
  20. Long Range Arena: A Benchmark for Efficient Transformers. arXiv preprint arXiv:2011.04006, 2020a.
  21. Efficient Transformers: A Survey. arXiv preprint arXiv:2009.06732, 2020b.
  22. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971, 2023.
  23. Attention Is All You Need. Advances In Neural Information Processing Systems, 30, 2017.
  24. ClusterFormer: Neural Clustering Attention for Efficient and Effective Transformer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2390–2402, 2022.
  25. Linformer: Self-Attention with Linear Complexity. arXiv preprint arXiv:2006.04768, 2020.
  26. Big Bird: Transformers for Longer Sequences. Advances In Neural Information Processing Systems, 33:17283–17297, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Rares Dolga (5 papers)
  2. Marius Cobzarenco (3 papers)
  3. David Barber (54 papers)
  4. Lucas Maystre (18 papers)