Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (2404.07143v2)

Published 10 Apr 2024 in cs.CL, cs.AI, cs.LG, and cs.NE
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Abstract: This work introduces an efficient method to scale Transformer-based LLMs to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context LLMing benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.

Efficient Scaling of Transformer LLMs for Infinitely Long Inputs via Infini-attention

Introduction

Transformers, since their inception, have significantly advanced the capabilities of LLMs. However, the quadratic complexity in memory and computation posed challenges when scaling to longer input sequences. This work introduces an efficient method to address this limitation, presenting a novel attention mechanism named Infini-attention. By integrating a compressive memory into the standard transformer architecture, Infini-attention enables the processing of infinitely long inputs with a bounded memory footprint and computational cost. The approach demonstrates superior performance on benchmarks for long-context LLMing, showcasing the potential for broader application in tasks requiring extensive context understanding.

Infini-attention Mechanism

The crux of this advancement lies in the introduction of Infini-attention, which harmonizes local and global contexts within a single transformer block, thereby enabling the model to handle input sequences of arbitrary length while maintaining a fixed computational budget. This is achieved by:

  • Embedding a Compressive Memory: The mechanism efficiently encodes long-term context into a compact, fixed-size memory, which persists across processing segments.
  • Maintaining Efficient Attention: Through a clever reuse of attention's key-value (KV) pairs, the system supports incremental learning over vast inputs without the need to increase the memory requirement linearly.
  • Enabling Recurrence in Attention Layers: By updating the associative memory matrix incrementally, Infini-attention facilitates a recurrence mechanism within each attention layer, thereby allowing the model to retain a coherent understanding of extended contexts.

Experimental Validation

This paper substantiates its claims through rigorous evaluation, achieving state-of-the-art results on challenging tasks:

  • Long-context LLMing: The model demonstrated a notable improvement in perplexity scores over existing baseline models on the PG19 and Arxiv-math benchmarks.
  • Passkey Context Block Retrieval: With continual pre-training, the model efficiently solved passkey retrieval tasks over contexts as lengthy as 1M tokens.
  • Book Summarization: Achieving new benchmarks in the 500K length book summarization task, the model outperformed prior state-of-the-art models, including those explicitly designed for summarization.

Comparative Analysis with Existing Models

The introduction of Infini-Transformers significantly outperforms existing segment-level memory models in handling long-context tasks. As illustrated in the comparison of memory footprint and effective context length among various models, Infini-Transformers operate with a dramatically lower memory requirement while offering an unbounded context window. This efficiency is further underscored by the model's ability to compress information more than 100x compared to Memory Transformers, enabling unparalleled data compression without a loss in modeling quality.

Implications and Future Directions

The development of Infini-attention and its integration into Transformer LLMs presents significant theoretical and practical advancements in the field of generative AI. By demonstrating the feasibility of processing infinitely long inputs with bounded resources, this work opens new avenues for research and application in areas where understanding extensive contextual information is paramount. Future explorations could extend this framework to other domains, improve memory compression techniques further, and optimize the architecture for more extensive datasets and more complex tasks.

Conclusion

In summary, this paper presents a significant leap in the efficiency and applicability of Transformer-based LLMs for handling long input sequences. By introducing Infini-attention, it showcases a method to scale these models effectively, ensuring computational and memory efficiency without compromising performance. The demonstrated improvements in long-context modeling tasks further establish the potential of this approach to fundamentally enhance the capabilities of LLMs in dealing with extensive sequential data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Using fast weights to attend to the recent past. Advances in neural information processing systems, 29, 2016.
  3. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  4. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  5. Unlimiformer: Long-range transformers with unlimited length input. Advances in Neural Information Processing Systems, 36, 2024.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079–11091, 2022.
  8. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023a.
  9. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307, 2023b.
  10. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
  11. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788, 2023.
  12. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  13. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
  14. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  15. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  16. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
  17. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024.
  18. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945, 2023.
  19. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
  20. Olmo: Accelerating the science of language models. arXiv preprint arXiv:2402.00838, 2024.
  21. Donald Olding Hebb. The organization of behavior: A neuropsychological theory. Psychology press, 2005.
  22. Using fast weights to deblur old memories. In Proceedings of the ninth annual conference of the Cognitive Science Society, pp.  177–186, 1987.
  23. John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
  24. Pentti Kanerva. Sparse distributed memory. MIT press, 1988.
  25. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156–5165. PMLR, 2020.
  26. The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 2024.
  27. Booksum: A collection of datasets for long-form narrative summarization. arXiv preprint arXiv:2105.08209, 2021.
  28. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  29. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
  30. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024.
  31. Differentiable plasticity: training plastic neural networks with backpropagation. In International Conference on Machine Learning, pp. 3559–3568. PMLR, 2018.
  32. Random-access infinite context length for transformers. Advances in Neural Information Processing Systems, 36, 2024.
  33. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36, 2024.
  34. Meta networks. In International conference on machine learning, pp. 2554–2563. PMLR, 2017a.
  35. Neural semantic encoders. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 1, pp.  397. NIH Public Access, 2017b.
  36. Citation analysis with neural attention models. In Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis, pp.  69–77, 2016.
  37. Metalearned neural memory. Advances in Neural Information Processing Systems, 32, 2019.
  38. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  39. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  40. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  41. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
  42. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  43. Parallel context windows improve in-context learning of large language models. arXiv preprint arXiv:2212.10947, 2022.
  44. Enhancing the transformer with explicit relational encoding for math problem solving. arXiv preprint arXiv:1910.06611, 2019.
  45. Learning associative inference using fast weight memory. arXiv preprint arXiv:2011.07831, 2020.
  46. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pp. 9355–9366. PMLR, 2021.
  47. Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992.
  48. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596–4604. PMLR, 2018.
  49. Efficient attention: Attention with linear complexities. arXiv preprint arXiv:1812.01243, 2018.
  50. Paul Smolensky. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial intelligence, 46(1-2):159–216, 1990.
  51. End-to-end memory networks. Advances in neural information processing systems, 28, 2015.
  52. Not all memories are created equal: Learning to forget by expiring. In International Conference on Machine Learning, pp. 9902–9912. PMLR, 2021.
  53. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  54. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  55. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
  56. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  57. Primera: Pyramid-based masked sentence pre-training for multi-document summarization. arXiv preprint arXiv:2110.08499, 2021.
  58. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Tsendsuren Munkhdalai (24 papers)
  2. Manaal Faruqui (39 papers)
  3. Siddharth Gopal (3 papers)
Citations (73)
Youtube Logo Streamline Icon: https://streamlinehq.com