Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SparQ Attention: Bandwidth-Efficient LLM Inference (2312.04985v6)

Published 8 Dec 2023 in cs.LG
SparQ Attention: Bandwidth-Efficient LLM Inference

Abstract: The computational difficulties of LLM inference remain a significant obstacle to their widespread deployment. The need for many applications to support long input sequences and process them in large batches typically causes token-generation to be bottlenecked by data transfer. For this reason, we introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within the attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to 8x savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a wide range of downstream tasks.

Introduction

Transformer models have become increasingly effective at solving complex language processing tasks by pre-training on extensive text corpora. LLMs benefit from this approach, offering versatile capabilities across various text-based applications. However, an obstacle often encountered with LLMs is their high computational demand during inference. This limitation is particularly pronounced when processing a large number of samples with extended contexts, leading to significant memory and bandwidth requirements. Addressing this challenge, a new technique known as SparQ Attention has been introduced, aiming to enhance inference efficiency by selectively fetching relevant portions of cached history within the attention mechanism.

SparQ Attention Algorithm

SparQ Attention improves the efficiency of LLMs during inference by optimizing the attention mechanism's memory bandwidth usage. The technique works through three sequential steps. Initially, it locates the most significant components of the incoming query vector, then approximates initial attention scores using these components. Next, it captures the full key and value vectors for top-scoring tokens only. The final step amalgamates the results from the previous steps, interpolating the top scores with a running mean of the value vectors. Notably, this approach can reduce memory bandwidth demand up to eight times without loss in accuracy, and it can be directly applied to existing LLMs without altering pre-training setups or requiring additional fine-tuning.

Experiment and Results

The practical efficacy of SparQ Attention was tested across a variety of downstream tasks, including question answering, summarization, LLMing, and textual repetition. These tasks were designed to assess model performance in the presence of reduced data transfers and to pit SparQ Attention against other sparse attention methodologies. Exemplary performance was demonstrated using models such as Llama 2 and Pythia, with up to a billion parameters, across tasks that required long-sequence context processing. The technique was found to be robust, with bandwidth compression ratios ranging from 2× to 8×, often with negligible degradation in task performance.

Discussion and Related Work

This paper fits within a broader context of research that strives to improve the efficiency of attention mechanisms, including work on sparse attention and reduction of memory footprint in LLMs. Previous studies have introduced various models designed to improve efficiency, but many require modifications during the pre-training phase and may have trade-offs in task performance. In contrast, SparQ Attention stands out as it can be applied during inference to models without adjustments to their pre-trained weights. Despite showing substantial improvements in memory bandwidth reduction, SparQ Attention has its limitations; it conservatively manages memory and may have unexplored effects when combined with different transformer model variants. Future research may extend its applicability or overcome these limitations, potentially augmenting its role in efficient LLM inference further.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. GQA: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
  2. Anonymous. Iceformer: Accelerated inference with long-sequence transformers on CPUs. In Submitted to The Twelfth International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=6RR3wU4mSZ. under review.
  3. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  4. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  5. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  6. Scatterbrain: Unifying sparse and low-rank attention. Advances in Neural Information Processing Systems, 34:17413–17426, 2021.
  7. Generating long sequences with sparse transformers, 2019.
  8. LM-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
  9. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
  10. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  11. Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks. https://github.com/karpathy/char-rnn, 2015.
  12. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  13. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  14. Llm-qat: Data-free quantization aware training for large language models, 2023a.
  15. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. arXiv preprint arXiv:2305.17118, 2023b.
  16. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  17. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  18. Improving language understanding by generative pre-training, 2018. URL https://openai.com/research/language-unsupervised.
  19. SQuAD: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  20. Combiner: Full attention transformer with sparse computation cost. Advances in Neural Information Processing Systems, 34:22470–22482, 2021.
  21. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368, 2017.
  22. Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
  23. Flexgen: high-throughput generative inference of large language models with a single GPU. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
  24. Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864.
  25. Sparse sinkhorn attention. In International Conference on Machine Learning, pages 9438–9447. PMLR, 2020a.
  26. Efficient transformers: A survey. CoRR, abs/2009.06732, 2020b. URL https://arxiv.org/abs/2009.06732.
  27. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  28. Jesse Vig. A multiscale visualization of attention in the transformer model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 37–42, 01 2019.
  29. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
  30. o⁢(n)𝑜𝑛o(n)italic_o ( italic_n ) connections are expressive enough: Universal approximability of sparse transformers. Advances in Neural Information Processing Systems, 33:13783–13794, 2020.
  31. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33:17283–17297, 2020.
  32. H22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPTO: Heavy-hitter oracle for efficient generative inference of large language models. arXiv preprint arXiv:2306.14048, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Luka Ribar (6 papers)
  2. Ivan Chelombiev (4 papers)
  3. Luke Hudlass-Galley (2 papers)
  4. Charlie Blake (6 papers)
  5. Carlo Luschi (18 papers)
  6. Douglas Orr (10 papers)
Citations (24)
Youtube Logo Streamline Icon: https://streamlinehq.com