Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Streaming Language Models with Attention Sinks (2309.17453v4)

Published 29 Sep 2023 in cs.CL and cs.AI
Efficient Streaming Language Models with Attention Sinks

Abstract: Deploying LLMs in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient LLMing with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided at https://github.com/mit-han-lab/streaming-LLM.

Introduction

The deployment of LLMs in streaming applications demands an approach that addresses the challenges of both extensive memory consumption during decoding and limited generalization to longer text sequences. Existing methods, such as window attention and the sliding window with re-computation, present their own limitations. Window attention fails when text length exceeds the cache size, and sliding window with re-computation, despite its strong performance, suffers from impractical latency for live applications due to quadratic attention complexity.

Attention Sink Phenomenon

The researchers behind StreamingLLM investigated the underlying issue with window attention and identified a key phenomenon they term "attention sink." This concept refers to the allocation of large attention scores towards less relevant initial tokens. Their analysis reveals that LLMs, due to the softmax operation in attention mechanisms, tend to disproportionately focus on these initial tokens, providing a stable 'sink' for attention that doesn't necessarily correlate with semantic significance. The introduction of just four initial tokens as attention sinks can stabilize LLM performance, showcasing that these tokens function primarily as positionally-biased anchors for attention distribution.

StreamingLLM Framework

To combat these challenges, StreamingLLM proposes a novel framework that maintains efficient performance over infinite input sequences without additional fine-tuning. By retaining the Key and Value (KV) states of a finite window of recent tokens alongside a consistent set of attention sink tokens, StreamingLLM sidesteps the model collapse experienced by window attention. Furthermore, this research suggests that pre-training LLMs with a dedicated attention sink token significantly improves performance, facilitating a single token's capacity to act as an attention anchor, thereby optimizing models for streaming deployment.

Evaluation and Performance

Empirical results reinforce the efficacy of StreamingLLM across a variety of model families, such as Llama-2, MPT, Falcon, and Pythia. The framework demonstrates the capacity to perform LLMing with extended texts of over 4 million tokens, achieving up to a 22.2x speedup compared to the sliding window with re-computation baseline. In simulated streaming question-answering environments, StreamingLLM matches the accuracy of standard, non-streaming baselines while maintaining continuous input performance. Additionally, pre-training LLMs with a sink token was shown to preserve or marginally improve the model performance in streaming cases. These findings offer a compelling solution to the deployment of LLMs for real-time applications requiring long-duration interactions and processing substantial text volumes efficiently.

Conclusion

StreamingLLM decouples the intrinsic limitation imposed by an LLM's pre-training attention window, facilitating efficient streaming application with prolonged text without the need to fine-tune models. It represents a significant stride in making the continuous deployment of LLMs more achievable across a breadth of platforms and applications. The insights and methodologies provided could serve as an essential framework for future research and implementation in the field of streaming LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Etc: Encoding long and structured inputs in transformers, 2020.
  2. Falcon-40B: an open large language model with state-of-the-art performance. 2023.
  3. Dynamic context pruning for efficient and interpretable autoregressive transformers, 2023.
  4. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  5. Longformer: The long-document transformer, 2020. arXiv:2004.05150.
  6. Pythia: A suite for analyzing large language models across training and scaling, 2023.
  7. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  8. bloc97. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/.
  9. Quantizable transformers: Removing outliers by helping attention heads do nothing, 2023.
  10. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  11. Evaluating large language models trained on code, 2021.
  12. Extending context window of large language models via positional interpolation, 2023. arXiv: 2306.15595.
  13. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  14. Generating long sequences with sparse transformers. 2019.
  15. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  16. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  17. FlashAttention: Fast and memory-efficient exact attention with IO-awareness, 2022. arXiv:2205.14135.
  18. Vision transformers need registers, 2023.
  19. A dataset of information-seeking questions and answers anchored in research papers, 2021.
  20. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:52967399.
  21. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  22. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  23. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  24. Evaluating factuality in generation with dependency-level entailment. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 2020. Association for Computational Linguistics.
  25. LM-Infinite: Simple on-the-fly length generalization for large language models, 2023.
  26. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, December 2020.
  27. Efficient attentions for long document summarization, 2021.
  28. kaiokendev. Things I’m learning while training superhot., 2023. URL https://kaiokendev.github.io/til#extending-context-to-8k.
  29. Evaluating open-domain question answering in the era of large language models, 2023.
  30. Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net, April 2020.
  31. The narrativeqa reading comprehension challenge, 2017.
  32. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat.
  33. Lost in the middle: How language models use long contexts, 2023.
  34. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
  35. Evan Miller. Attention is off by one, 2023. URL https://www.evanmiller.org/attention-is-off-by-one.html.
  36. OpenAI. Gpt-4 technical report, 2023.
  37. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1525–1534, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1144. URL https://aclanthology.org/P16-1144.
  38. Yarn: Efficient context window extension of large language models, 2023.
  39. Efficiently scaling transformer inference. arXiv preprint arXiv:2211.05102, 2022.
  40. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
  41. Improving language understanding by generative pre-training. 2018.
  42. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2020.
  43. Code Llama: Open foundation models for code, 2023.
  44. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
  45. Chatgpt: Optimizing language models for dialogue. OpenAI blog, 2022.
  46. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  47. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  48. Efficient transformers: A survey. ACM Computing Surveys, 55(6), dec 2022. ISSN 0360-0300.
  49. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
  50. Together. Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api, June 2023. URL https://together.ai/blog/llama-2-7b-32k-instruct.
  51. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  52. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  53. Spatten: Efficient sparse attention architecture with cascade token and head pruning. HPCA, 2021.
  54. Linformer: Self-attention with linear complexity. 2020.
  55. Huggingface’s transformers: State-of-the-art natural language processing, 2020.
  56. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023.
  57. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
  58. Big Bird: Transformers for longer sequences. In Proc. of NeurIPS, volume 33, 2020a.
  59. Big bird: Transformers for longer sequences. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020. Curran Associates, Inc., 2020b.
  60. Hellaswag: Can a machine really finish your sentence? CoRR, abs/1905.07830, 2019. URL http://arxiv.org/abs/1905.07830.
  61. Opt: Open pre-trained transformer language models, 2022.
  62. Benchmarking large language models for news summarization, 2023a.
  63. H22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPTo: Heavy-hitter oracle for efficient generative inference of large language models, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Guangxuan Xiao (16 papers)
  2. Yuandong Tian (128 papers)
  3. Beidi Chen (61 papers)
  4. Song Han (155 papers)
  5. Mike Lewis (78 papers)
Citations (403)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com