Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stateful Large Language Model Serving with Pensieve (2312.05516v3)

Published 9 Dec 2023 in cs.LG and cs.DC

Abstract: LLMs are wildly popular today and it is important to serve them efficiently. Existing LLM serving systems are stateless across requests. Consequently, when LLMs are used in the common setting of multi-turn conversations, a growing log of the conversation history must be processed alongside any request by the serving system at each turn, resulting in repeated processing. In this paper, we design $Pensieve$, a system optimized for multi-turn conversation LLM serving. $Pensieve$ maintains the conversation state across requests by caching previously processed history to avoid duplicate processing. $Pensieve$'s multi-tier caching strategy can utilize both GPU and CPU memory to efficiently store and retrieve cached data. $Pensieve$ also generalizes the recent PagedAttention kernel to support attention between multiple input tokens with a GPU cache spread over non-contiguous memory. Our evaluation shows that $Pensieve$ can achieve $1.14$-$3.0\times$ the throughput of vLLM and TensorRT-LLM and significantly reduce latency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2022.
  2. PipeSwitch: Fast pipelined context switching for deep learning applications. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 499–514, 2020.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Accelerating large language model decoding with speculative sampling. In arXiv:2302.01318, 2023.
  5. TVM: An automated End-to-End optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.
  6. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  7. Memory harvesting in {{\{{Multi-GPU}}\}} systems with hierarchical unified virtual memory. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 625–638, 2022.
  8. Inferline: latency-aware provisioning and scaling for prediction serving pipelines. In Proceedings of the 11th ACM Symposium on Cloud Computing, pages 477–491, 2020.
  9. Clipper: A Low-Latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 613–627, 2017.
  10. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  11. Flash-Decoding for long-context inference. https://pytorch.org/blog/flash-decoding/, 2023.
  12. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  13. The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning, pages 7750–7774. PMLR, 2023.
  14. Low latency rnn inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference, pages 1–15, 2018.
  15. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
  16. Lora: Low-rank adaptation of large language models. 2021.
  17. SwapAdvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1341–1355, 2020.
  18. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems, 2:497–511, 2020.
  19. Fast and efficient model serving using multi-gpus with direct-host-access. In Proceedings of the Eighteenth European Conference on Computer Systems, pages 249–265, 2023.
  20. Deepum: Tensor migration and prefetching in unified memory. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 207–221, 2023.
  21. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023.
  22. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  23. Fast inference from transformers via speculative decoding. In ICML, 2023.
  24. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
  25. Large graph convolutional network training with gpu-oriented data communication architecture. arXiv preprint arXiv:2103.03330, 2021.
  26. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021.
  27. NVIDIA. Triton Inference Server. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html.
  28. NVIDIA. CUDA Unified Memory. https://developer.nvidia.com/blog/unified-memory-cuda-beginners/, 2017.
  29. NVIDIA. CUTLASS. https://github.com/NVIDIA/cutlass, 2017.
  30. NVIDIA. FasterTransformer. https://github.com/NVIDIA/FasterTransformer, 2021.
  31. Tensorflow-serving: Flexible, high-performance ml serving. arXiv preprint arXiv:1712.06139, 2017.
  32. Self-attention does not need O⁢(n2)𝑂superscript𝑛2{O}(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory. arXiv preprint arXiv:2112.05682, 2021.
  33. ZeRO-Offload: Democratizing Billion-Scale model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021.
  34. {{\{{INFaaS}}\}}: Automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 397–411, 2021.
  35. ShareGPT. https://sharegpt.com/, 2023.
  36. Nexus: A gpu cluster engine for accelerating dnn-based video analysis. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 322–337, 2019.
  37. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023.
  38. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  39. TensorFlow XLA. https://www.tensorflow.org/xla/, 2017.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  41. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  42. LightSeq: A high performance inference library for transformers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers (NAACL-HLT), pages 113–120. Association for Computational Linguistics, June 2021.
  43. Tabi: An efficient multi-level inference system for large language models. In Proceedings of the Eighteenth European Conference on Computer Systems, pages 233–248, 2023.
  44. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
  45. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  46. Lmsys-chat-1m: A large-scale real-world llm conversation dataset. arXiv preprint arXiv:2309.11998, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Lingfan Yu (4 papers)
  2. Jinyang Li (67 papers)
  3. Jinkun Lin (8 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets