Introduction
The deployment of LLMs in streaming applications demands an approach that addresses the challenges of both extensive memory consumption during decoding and limited generalization to longer text sequences. Existing methods, such as window attention and the sliding window with re-computation, present their own limitations. Window attention fails when text length exceeds the cache size, and sliding window with re-computation, despite its strong performance, suffers from impractical latency for live applications due to quadratic attention complexity.
Attention Sink Phenomenon
The researchers behind StreamingLLM investigated the underlying issue with window attention and identified a key phenomenon they term "attention sink." This concept refers to the allocation of large attention scores towards less relevant initial tokens. Their analysis reveals that LLMs, due to the softmax operation in attention mechanisms, tend to disproportionately focus on these initial tokens, providing a stable 'sink' for attention that doesn't necessarily correlate with semantic significance. The introduction of just four initial tokens as attention sinks can stabilize LLM performance, showcasing that these tokens function primarily as positionally-biased anchors for attention distribution.
StreamingLLM Framework
To combat these challenges, StreamingLLM proposes a novel framework that maintains efficient performance over infinite input sequences without additional fine-tuning. By retaining the Key and Value (KV) states of a finite window of recent tokens alongside a consistent set of attention sink tokens, StreamingLLM sidesteps the model collapse experienced by window attention. Furthermore, this research suggests that pre-training LLMs with a dedicated attention sink token significantly improves performance, facilitating a single token's capacity to act as an attention anchor, thereby optimizing models for streaming deployment.
Evaluation and Performance
Empirical results reinforce the efficacy of StreamingLLM across a variety of model families, such as Llama-2, MPT, Falcon, and Pythia. The framework demonstrates the capacity to perform LLMing with extended texts of over 4 million tokens, achieving up to a 22.2x speedup compared to the sliding window with re-computation baseline. In simulated streaming question-answering environments, StreamingLLM matches the accuracy of standard, non-streaming baselines while maintaining continuous input performance. Additionally, pre-training LLMs with a sink token was shown to preserve or marginally improve the model performance in streaming cases. These findings offer a compelling solution to the deployment of LLMs for real-time applications requiring long-duration interactions and processing substantial text volumes efficiently.
Conclusion
StreamingLLM decouples the intrinsic limitation imposed by an LLM's pre-training attention window, facilitating efficient streaming application with prolonged text without the need to fine-tune models. It represents a significant stride in making the continuous deployment of LLMs more achievable across a breadth of platforms and applications. The insights and methodologies provided could serve as an essential framework for future research and implementation in the field of streaming LLMs.