Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention (2407.02490v2)

Published 2 Jul 2024 in cs.CL and cs.LG
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Abstract: The computational challenges of LLM inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https://aka.ms/MInference.

Overview of MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

The paper "MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention" addresses a critical bottleneck in the deployment of LLMs with extended context windows. The authors focus on optimizing the pre-filling stage of LLMs that process long sequences of tokens up to 1 million tokens, particularly mitigating the computational challenges posed by the quadratic complexity of the attention mechanism.

Key Contributions and Methodology

Identification of Attention Patterns

The paper identifies three characteristic patterns in the attention matrices of long-context LLMs—namely, A-shape, Vertical-Slash (VS), and Block-Sparse patterns. These patterns reveal spatial aggregations of sparse attention weights, which the authors exploit to perform efficient sparse computations on GPUs.

  1. A-shape Pattern: Concentrates on initial tokens and local windows.
  2. Vertical-Slash Pattern: Combines vertical attention lines and fixed-interval slash lines.
  3. Block-Sparse Pattern: Focuses on clusters of top attention weights grouped in blocks.

The authors develop a kernel-aware search method to determine the optimal attention pattern for each head, balancing computational efficiency with retention of model accuracy. This search is performed offline to establish the most effective pattern configurations.

Dynamic Sparse Attention Calculations

During inference, MInference dynamically builds sparse indices for attention heads based on the identified patterns. This adaptation considers the specific input to generate the most efficient sparse mask. For example, a partial computation using the last few query vectors aids in estimating the critical indices of vertical and slash lines for the VS pattern. Similarly, for block-sparse heads, mean pooling on query and key vectors approximates the most significant blocks to include in the sparse mask.

The subsequent computation employs optimized GPU kernels, leveraging sparse compilation technologies like PIT, Triton, and FlashAttention to accelerate the attention mechanism, significantly reducing latency during the pre-filling stage.

Experimental Validation

The authors conduct extensive experiments on several state-of-the-art LLMs (LLaMA-3-8B, GLM-4-9B, and Yi-9B, among others) across diverse benchmarks, including InfiniteBench, RULER, and Needle In A Haystack, as well as LLMing tasks with PG-19. Key findings include:

  • Accuracy Maintenance: MInference maintains or even slightly enhances the long-context capabilities of the LLMs compared to full attention baselines.
  • Significant Speedups: It achieves up to 10x speedup for 1M token contexts on an Nvidia A100 GPU, reducing pre-filling latency from 30 minutes to 3 minutes while sustaining model accuracy.
  • Generalization: The method exhibits robust performance across various tasks and datasets, demonstrating its applicability.

Implications and Future Directions

The practical implications of this research are profound. By substantially accelerating the pre-filling stage without compromising accuracy, MInference facilitates the deployment of long-context LLMs in real-world applications that require processing large contexts, such as legal document analysis, large-scale code understanding, and comprehensive textual queries.

This method also reduces the computational cost associated with LLMs, making them more accessible and feasible for a broader range of users and applications. Furthermore, the compatibility of MInference with existing LLM architectures without necessitating additional training adjustments highlights its practical utility.

Future developments in this domain could explore further optimizing the balance between computational overhead and inference efficiency. Additionally, integrating MInference with other inference optimization techniques, such as KV cache compression methods like SnapKV, could yield further improvements in both latency and efficiency.

Moreover, dynamic sparse attention techniques could be extended to other forms of neural networks beyond autoregressive models, such as encoder-decoder models or multi-modal LLMs, potentially revealing broader applications and efficiency improvements.

In conclusion, MInference represents a significant stride towards efficient long-context processing in LLMs, providing a scalable approach to handling the ever-expanding demands of modern AI applications. This work lays the groundwork for ongoing innovations in sparse computation and efficient inference, promising enhanced performance and reduced costs for future AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference. Proceedings of Machine Learning and Systems, 6:114–127, 2024.
  2. Phi-3 technical report: A highly capable language model locally on your phone. ArXiv, abs/2404.14219, 2024.
  3. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. 2023.
  4. Unlimiformer: Long-range transformers with unlimited length input. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  5. Qwen technical report. ArXiv preprint, abs/2309.16609, 2023.
  6. Longformer: The long-document transformer. ArXiv preprint, abs/2004.05150, 2020.
  7. Codeplan: Repository-level coding using LLMs and planning. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
  8. Generating long sequences with sparse transformers. ArXiv preprint, abs/1904.10509, 2019.
  9. Peek across: Improving multi-document modeling via cross-document question-answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1970–1989, 2023.
  10. Extending context window of large language models via positional interpolation. ArXiv preprint, abs/2306.15595, 2023.
  11. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024.
  12. DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024.
  13. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2024.
  14. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. In Forty-first International Conference on Machine Learning, 2024.
  15. Sequence can secretly tell you what to discard. ArXiv preprint, abs/2404.15949, 2024.
  16. Longnet: Scaling transformers to 1,000,000,000 tokens. ArXiv preprint, abs/2307.02486, 2023.
  17. Attention is naturally sparse with gaussian distributed input. ArXiv preprint, abs/2404.02690, 2024.
  18. Get more with LESS: Synthesizing recurrence with KV cache compression for efficient LLM inference. In Forty-first International Conference on Machine Learning, 2024.
  19. LongroPE: Extending LLM context window beyond 2 million tokens. In Forty-first International Conference on Machine Learning, 2024.
  20. Data engineering for scaling language models to 128k context. In Forty-first International Conference on Machine Learning, 2024.
  21. Yao Fu. Challenges in deploying long-context transformers: A theoretical peak performance analysis. ArXiv preprint, abs/2405.08944, 2024.
  22. Mamba: Linear-time sequence modeling with selective state spaces. ArXiv preprint, abs/2312.00752, 2023.
  23. Efficiently modeling long sequences with structured state spaces. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.
  24. Gradient. Llama-3 8b instruct gradient 4194k (v0.1), 2024.
  25. Model tells you what to discard: Adaptive kv cache compression for llms. In The Twelfth International Conference on Learning Representations, 2024.
  26. Chatglm: A family of large language models from glm-130b to glm-4 all tools. ArXiv preprint, abs/2406.12793, 2024.
  27. Block transformer: Global-to-local language modeling for fast inference. ArXiv preprint, abs/2406.02657, 2024.
  28. Ruler: What’s the real context size of your long-context language models? ArXiv preprint, abs/2404.06654, 2024.
  29. LM-infinite: Zero-shot extreme length generalization for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3991–4008, Mexico City, Mexico, 2024. Association for Computational Linguistics.
  30. Mistral 7b. ArXiv preprint, abs/2310.06825, 2023.
  31. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. ArXiv preprint, abs/2309.14509, 2023.
  32. Llmlingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376, 2023.
  33. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2024.
  34. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2023.
  35. Greg Kamradt. Needle in a haystack - pressure testing llms, 2023.
  36. Reformer: The efficient transformer. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
  37. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 5156–5165. PMLR, 2020.
  38. Block pruning for faster transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10619–10629, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
  39. On the expressive power of self-attention matrices. ArXiv preprint, abs/2106.03764, 2021.
  40. On the expressive flexibility of self-attention matrices. Proceedings of the AAAI Conference on Artificial Intelligence, 37(7):8773–8781, 2023.
  41. Compressing context to enhance inference efficiency of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  42. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024.
  43. Snapkv: Llm knows what you are looking for before generation. ArXiv preprint, abs/2404.14469, 2024.
  44. Jamba: A hybrid transformer-mamba language model. ArXiv preprint, abs/2403.19887, 2024.
  45. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  46. nnscaler: Constraint-guided parallelization plan generation for deep learning training. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, 2024.
  47. Dynamic sparse attention for scalable transformer acceleration. IEEE Transactions on Computers, 71(12):3165–3178, 2022.
  48. Deja vu: Contextual sparsity for efficient LLMs at inference time. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2023.
  49. World model on million-length video and language with ringattention. ArXiv preprint, abs/2402.08268, 2024.
  50. Ring attention with blockwise transformers for near-infinite context. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
  51. Long-context llms struggle with long in-context learning. ArXiv preprint, abs/2404.02060, 2024.
  52. Iceformer: Accelerated inference with long-sequence transformers on CPUs. In The Twelfth International Conference on Learning Representations, 2024.
  53. Leave no context behind: Efficient infinite context transformers with infini-attention. ArXiv preprint, abs/2404.07143, 2024.
  54. Dynamic memory compression: Retrofitting LLMs for accelerated inference. In Forty-first International Conference on Machine Learning, 2024.
  55. Xgen-7b technical report. ArXiv preprint, abs/2309.03450, 2023.
  56. Transformers are multi-state rnns. ArXiv preprint, abs/2401.06104, 2024.
  57. RWKV: Reinventing RNNs for the transformer era. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077, Singapore, 2023. Association for Computational Linguistics.
  58. Generative agents: Interactive simulacra of human behavior. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023.
  59. Fast attention over long sequences with dynamic sparse flash attention. Advances in Neural Information Processing Systems, 36, 2024.
  60. Yarn: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024.
  61. Train short, test long: Attention with linear biases enables input length extrapolation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.
  62. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Findings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, 2024.
  63. Sparq attention: Bandwidth-efficient LLM inference. In Forty-first International Conference on Machine Learning, 2024.
  64. Samba: Simple hybrid state space models for efficient unlimited context language modeling. ArXiv preprint, abs/2406.07522, 2024.
  65. Compressive transformers for long-range sequence modelling. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
  66. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  67. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. ArXiv preprint, abs/2403.05530, 2024.
  68. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021.
  69. Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding. ArXiv preprint, abs/2404.11912, 2024.
  70. Retentive network: A successor to transformer for large language models. ArXiv preprint, abs/2307.08621, 2023.
  71. You only cache once: Decoder-decoder architectures for language models. ArXiv preprint, abs/2405.05254, 2024.
  72. Sparsebert: Rethinking the importance analysis in self-attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 9547–9557. PMLR, 2021.
  73. Noam Shazeer. Fast transformer decoding: One write-head is all you need. ArXiv preprint, abs/1911.02150, 2019.
  74. UL2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, 2023.
  75. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.
  76. Triton implementation of the flash attention v2 algorithm. Technical report, OpenAI, 2023.
  77. Focused transformer: Contrastive training for context scaling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  78. QUEST: Query-aware sparsity for efficient long-context LLM inference. In Forty-first International Conference on Machine Learning, 2024.
  79. Lilian Weng. Llm-powered autonomous agents. lilianweng.github.io, 2023.
  80. Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference. ArXiv preprint, abs/2406.18139, 2024.
  81. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 97–110. IEEE, 2021.
  82. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024.
  83. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory. ArXiv preprint, abs/2402.04617, 2024.
  84. Yi: Open foundation models by 01. ai. ArXiv preprint, abs/2403.04652, 2024.
  85. A unified implicit attention formulation for gated-linear recurrent sequence models. ArXiv preprint, abs/2405.16504, 2024.
  86. ∞\infty∞bench: Extending long context evaluation beyond 100k tokens. ArXiv preprint, abs/2402.13718, 2024.
  87. Big bird: Transformers for longer sequences. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  88. Pit: Optimization of dynamic sparse deep learning models via permutation invariant transformation. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 331–347, 2023.
  89. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Huiqiang Jiang (32 papers)
  2. Yucheng Li (31 papers)
  3. Chengruidong Zhang (11 papers)
  4. Qianhui Wu (19 papers)
  5. Xufang Luo (25 papers)
  6. Surin Ahn (7 papers)
  7. Zhenhua Han (18 papers)
  8. Amir H. Abdi (14 papers)
  9. Dongsheng Li (240 papers)
  10. Chin-Yew Lin (22 papers)
  11. Yuqing Yang (83 papers)
  12. Lili Qiu (50 papers)
Citations (30)
Youtube Logo Streamline Icon: https://streamlinehq.com