Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation (2405.05329v2)

Published 8 May 2024 in cs.DC, cs.AI, and cs.CL

Abstract: LLM or LLM inference has two phases, the prompt (or prefill) phase to output the first token and the extension (or decoding) phase to the generate subsequent tokens. In this work, we propose an efficient parallelization scheme, KV-Runahead to accelerate the prompt phase. The key observation is that the extension phase generates tokens faster than the prompt phase because of key-value cache (KV-cache). Hence, KV-Runahead parallelizes the prompt phase by orchestrating multiple processes to populate the KV-cache and minimizes the time-to-first-token (TTFT). Dual-purposing the KV-cache scheme has two main benefits. First, since KV-cache is designed to leverage the causal attention map, we minimize computation and computation automatically. Second, since it already exists for the extension phase, KV-Runahead is easy to implement. We further propose context-level load-balancing to handle uneven KV-cache generation (due to the causal attention) and to optimize TTFT. Compared with an existing parallelization scheme such as tensor or sequential parallelization where keys and values are locally generated and exchanged via all-gather collectives, our experimental results demonstrate that KV-Runahead can offer over 1.4x and 1.6x speedups for Llama 7B and Falcon 7B respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023.
  2. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
  3. Falcon-40B: an open large language model with state-of-the-art performance. 2023.
  4. Accelerating large language model decoding with speculative sampling, 2023.
  5. edkm: An efficient and accurate train-time weight clustering for large language models. 2023.
  6. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 2022.
  7. A survey on in-context learning, 2023.
  8. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. In arXiv, 2023.
  9. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems, 2019.
  10. HuggingFace-TensorParallelism. https://huggingface.co/docs/transformers/v4.15.0/parallelism#tensor-parallelism.
  11. HuggingFace-Transformers. https://huggingface.co/docs/transformers/main_classes/output.
  12. Reducing activation recomputation in large transformer models, 2022.
  13. Efficient memory management for large language model serving with pagedattention, 2023.
  14. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, 2023.
  15. Sequence parallelism: Long sequence training from system perspective. In Association for Computational Linguistics (ACL), 2023.
  16. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache, 2024.
  17. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv, 2023.
  18. Cachegen: Fast context loading for language model applications, 2023a.
  19. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv, 2023b.
  20. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, 2021a.
  21. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021b.
  22. NVidia-cuBLAS. https://docs.nvidia.com/cuda/cublas/.
  23. NVidia-LLM. https://developer.nvidia.com/blog/mastering-llm-techniques-inference_optimization, 2023.
  24. NVidia-NCCL. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html, 2023.
  25. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022.
  26. OPTIMUS: optimized matrix multiplication structure for transformer neural network accelerator. In Dhillon, I. S., Papailiopoulos, D. S., and Sze, V. (eds.), Proceedings of Machine Learning and Systems (MLSys), 2020.
  27. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 2019.
  28. Splitwise: Efficient generative llm inference using phase splitting, 2023.
  29. Efficiently scaling transformer inference, 2022.
  30. In-context retrieval-augmented language models, 2023.
  31. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
  32. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications, 19(1):49–66, 2005.
  33. Llama: Open and efficient foundation language models. In arXiv, 2023.
  34. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
  35. vLLM. https://docs.vllm.ai/en/latest/.
  36. A survey of controllable text generation using transformer-based pre-trained language models, 2023a.
  37. Opt: Open pre-trained transformer language models. In arXiv, 2022a.
  38. Benchmarking large language models for news summarization, 2023b.
  39. A novel hierarchical hyper-parameter search algorithm based on greedy strategy for wind turbine fault diagnosis. Expert Systems with Applications, 202:117473, 2022b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Minsik Cho (36 papers)
  2. Mohammad Rastegari (57 papers)
  3. Devang Naik (26 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com