Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (2404.07143v2)

Published 10 Apr 2024 in cs.CL, cs.AI, cs.LG, and cs.NE

Abstract: This work introduces an efficient method to scale Transformer-based LLMs to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context LLMing benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.

PDF HTML Abstract

Efficient Scaling of Transformer LLMs for Infinitely Long Inputs via Infini-attention

Introduction

Transformers, since their inception, have significantly advanced the capabilities of LLMs. However, the quadratic complexity in memory and computation posed challenges when scaling to longer input sequences. This work introduces an efficient method to address this limitation, presenting a novel attention mechanism named Infini-attention. By integrating a compressive memory into the standard transformer architecture, Infini-attention enables the processing of infinitely long inputs with a bounded memory footprint and computational cost. The approach demonstrates superior performance on benchmarks for long-context LLMing, showcasing the potential for broader application in tasks requiring extensive context understanding.

Infini-attention Mechanism

The crux of this advancement lies in the introduction of Infini-attention, which harmonizes local and global contexts within a single transformer block, thereby enabling the model to handle input sequences of arbitrary length while maintaining a fixed computational budget. This is achieved by:

Embedding a Compressive Memory: The mechanism efficiently encodes long-term context into a compact, fixed-size memory, which persists across processing segments.
Maintaining Efficient Attention: Through a clever reuse of attention's key-value (KV) pairs, the system supports incremental learning over vast inputs without the need to increase the memory requirement linearly.
Enabling Recurrence in Attention Layers: By updating the associative memory matrix incrementally, Infini-attention facilitates a recurrence mechanism within each attention layer, thereby allowing the model to retain a coherent understanding of extended contexts.

Experimental Validation

This paper substantiates its claims through rigorous evaluation, achieving state-of-the-art results on challenging tasks:

Long-context LLMing: The model demonstrated a notable improvement in perplexity scores over existing baseline models on the PG19 and Arxiv-math benchmarks.
Passkey Context Block Retrieval: With continual pre-training, the model efficiently solved passkey retrieval tasks over contexts as lengthy as 1M tokens.
Book Summarization: Achieving new benchmarks in the 500K length book summarization task, the model outperformed prior state-of-the-art models, including those explicitly designed for summarization.

Comparative Analysis with Existing Models

The introduction of Infini-Transformers significantly outperforms existing segment-level memory models in handling long-context tasks. As illustrated in the comparison of memory footprint and effective context length among various models, Infini-Transformers operate with a dramatically lower memory requirement while offering an unbounded context window. This efficiency is further underscored by the model's ability to compress information more than 100x compared to Memory Transformers, enabling unparalleled data compression without a loss in modeling quality.

Implications and Future Directions

The development of Infini-attention and its integration into Transformer LLMs presents significant theoretical and practical advancements in the field of generative AI. By demonstrating the feasibility of processing infinitely long inputs with bounded resources, this work opens new avenues for research and application in areas where understanding extensive contextual information is paramount. Future explorations could extend this framework to other domains, improve memory compression techniques further, and optimize the architecture for more extensive datasets and more complex tasks.

Conclusion

In summary, this paper presents a significant leap in the efficiency and applicability of Transformer-based LLMs for handling long input sequences. By introducing Infini-attention, it showcases a method to scale these models effectively, ensuring computational and memory efficiency without compromising performance. The demonstrated improvements in long-context modeling tasks further establish the potential of this approach to fundamentally enhance the capabilities of LLMs in dealing with extensive sequential data.

PDF Markdown Bookmark Chat (Pro)

References (58)

Authors (3)

Tsendsuren Munkhdalai (24 papers)
Manaal Faruqui (39 papers)
Siddharth Gopal (3 papers)

Citations (73)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/AiBreakfast/status/1778605448003768723

https://twitter.com/arankomatsuzaki/status/1778230430090592454

https://twitter.com/_akhaliq/status/1778234586599727285

https://twitter.com/virattt/status/1779973553573495026

https://twitter.com/omarsar0/status/1778480897198612839

https://twitter.com/Yampeleg/status/1778696679639683473