Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding (2404.11912v3)

Published 18 Apr 2024 in cs.CL and cs.LG
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

Abstract: With LLMs widely deployed in long content generation recently, there has emerged an increasing demand for efficient long-sequence inference support. However, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length. Due to the auto-regressive nature of LLMs, the entire KV cache will be loaded for every generated token, resulting in low utilization of computational cores and high latency. While various compression methods for KV cache have been proposed to alleviate this issue, they suffer from degradation in generation quality. We introduce TriForce, a hierarchical speculative decoding system that is scalable for long sequence generation. This approach leverages the original model weights and dynamic sparse KV cache via retrieval as a draft model, which serves as an intermediate layer in the hierarchy and is further speculated by a smaller model to reduce its drafting latency. TriForce not only facilitates impressive speedups for Llama2-7B-128K, achieving up to 2.31$\times$ on an A100 GPU but also showcases scalability in handling even longer contexts. For the offloading setting on two RTX 4090 GPUs, TriForce achieves 0.108s/token$\unicode{x2014}$only half as slow as the auto-regressive baseline on an A100, which attains 7.78$\times$ on our optimized offloading system. Additionally, TriForce performs 4.86$\times$ than DeepSpeed-Zero-Inference on a single RTX 4090 GPU. TriForce's robustness is highlighted by its consistently outstanding performance across various temperatures. The code is available at https://github.com/Infini-AI-Lab/TriForce.

Hierarchical Speculative Decoding System for Efficient Long-Sequence Inference in LLMs

Introduction

LLMs excel in diverse applications but struggle with efficiency issues during long-sequence generation. A key challenge is the management of the Key-Value (KV) cache, which stores intermediary states to prevent redundant computations but escalates in size with sequence length, causing bottlenecks. Recent solutions involve speculative decoding, where lightweight models predict future tokens which are then validated by the target model. However, these solutions still suffer from various performance drawbacks when scaled to longer sequences.

Dual Bottleneck Observations and Core Insights

The paper identifies two primary memory bottlenecks: model weights and the KV cache. Key insights from the paper indicate:

  • Attention Sparsity: A large portion of the KV cache is redundant, where only a small subset is actively used, suggesting the feasibility of using partial caches for prediction without significant performance loss.
  • Contextual Locality: Adjacent tokens often require similar contextual information, suggesting potential efficiency in reusing cache segments across multiple tokens, thereby reducing computational overhead.

These insights lead to the design of TriForce, a hierarchical speculative decoding system that strategically uses partial caching and model weight reduction techniques to address these bottlenecks.

TriForce System Overview

TriForce integrates retrieval-based drafting with hierarchical speculative decoding to address KV cache and model weights bottlenecks effectively:

  • Retrieval-Based Drafting: Instead of permanently discarding KV pairs like traditional eviction methods, TriForce uses a dynamic retrieval strategy that selectively retains the most crucial KV pairs, enhancing both the efficiency and quality of the inference process.
  • Hierarchical Speculation: TriForce utilizes a smaller, lightweight model with a partial cache as an initial speculative layer, followed by verifying and refining predictions using the target model with a more complete but selectively reduced cache. This staged speculation reduces overall latency by addressing both identified bottlenecks sequentially.

Empirical Evaluation

TriForce was tested on NVIDIA A100 and RTX 4090 GPUs with models like Llama2-7B-128K:

  • Speed Improvement: Achieved up to 2.31x speed-up on a single A100 GPU and up to 7.78x speed-up using dual RTX 4090 GPUs in offloading settings.
  • Robustness and Scalability: Demonstrated high acceptance rates and consistent performance across various temperatures and settings. TriForce shows promising scalability, with theoretical projections suggesting further speed optimizations under extended context simulations.

Observations on KV Cache Handling

In-depth analysis of KV cache management shows:

  • Optimal Cache Budget: TriForce achieves optimal performance with a 4K KV cache budget, effectively balancing the trade-off between drafting overhead and acceptance rate.
  • Chunk Size Selection: Analysis reveals that smaller chunk sizes might overfit specific tokens, while excessively large sizes could dilute the significance of valuable tokens, indicating the importance of balanced chunk sizing in retrieval-based strategies.

Future Directions and Theoretical Implications

TriForce's architecture suggests significant potential for extending LLM applicability in real-world scenarios requiring long-context generation, such as document summarization and extended conversational agents. Additionally, the integration with tree-based speculative models could further enhance throughput and efficiency, showing a promising direction for future research in AI language processing optimizations.

Conclusion

This paper presents a compelling approach to solving the efficiency problems associated with LLMs in processing long sequences. By leveraging hierarchical speculative decoding and tactical KV cache management, TriForce not only improves inference speed but also maintains high generational quality, proving to be a substantial advancement in the practical deployment of large-scale LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2022.
  3. Hydra: Sequentially-dependent draft heads for medusa decoding. arXiv preprint arXiv:2402.05109, 2024.
  4. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  5. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024.
  6. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  7. Sequoia: Scalable, robust, and hardware-aware speculative decoding. arXiv preprint arXiv:2402.12374, 2024.
  8. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  9. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  10. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  11. Break the sequential dependency of llm inference using lookahead decoding. arXiv preprint arXiv:2402.02057, 2024.
  12. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023.
  13. Flashdecoding++: Faster large language model inference on gpus. arXiv preprint arXiv:2311.01282, 2023.
  14. Speed: Speculative pipelined execution for efficient decoding. arXiv preprint arXiv:2310.12072, 2023.
  15. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079, 2024.
  16. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  17. Llm maybe longlm: Self-extend llm context window without tuning. arXiv preprint arXiv:2401.01325, 2024.
  18. Speculative decoding with big little decoder. Advances in Neural Information Processing Systems, 36, 2024.
  19. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018. doi: 10.1162/tacl_a_00023. URL https://aclanthology.org/Q18-1023.
  20. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
  21. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
  22. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024.
  23. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024a.
  24. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024b.
  25. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023a.
  26. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023b.
  27. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024c.
  28. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023.
  29. Cuda, release: 10.2. 89. URL https://developer. nvidia. com/cuda-toolkit. Cited, page 148, 2020.
  30. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  31. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  32. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  33. Compressive transformers for long-range sequence modelling. arXiv preprint, 2019. URL https://arxiv.org/abs/1911.05507.
  34. Docfinqa: A long-context financial reasoning dataset. arXiv preprint arXiv:2401.06915, 2024.
  35. Accelerating transformer inference for translation via parallel decoding. arXiv preprint arXiv:2305.10427, 2023.
  36. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
  37. Amit Singhal et al. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35–43, 2001.
  38. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.
  39. Spectr: Fast speculative decoding via optimal transport. Advances in Neural Information Processing Systems, 36, 2024.
  40. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  41. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  42. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  43. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. arXiv preprint arXiv:2401.07851, 2024.
  44. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023a.
  45. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2023b.
  46. Decoding speculative decoding. arXiv preprint arXiv:2402.01528, 2024.
  47. No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization. arXiv preprint arXiv:2402.18096, 2024.
  48. Wkvquant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065, 2024.
  49. Draft & verify: Lossless large language model acceleration via self-speculative decoding. arXiv preprint arXiv:2309.08168, 2023.
  50. Soaring from 4k to 400k: Extending llm’s context with activation beacon. arXiv preprint arXiv:2401.03462, 2024a.
  51. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024b.
  52. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  53. Kivi : Plug-and-play 2bit kv cache quantization with streaming asymmetric quantization. 2023. doi: 10.13140/RG.2.2.28167.37282. URL https://rgdoi.net/10.13140/RG.2.2.28167.37282.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Hanshi Sun (9 papers)
  2. Zhuoming Chen (15 papers)
  3. Xinyu Yang (109 papers)
  4. Yuandong Tian (128 papers)
  5. Beidi Chen (61 papers)
Citations (26)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews