Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving (2407.15309v1)

Published 22 Jul 2024 in cs.DC and cs.LG

Abstract: LLMs are widely used across various domains, processing millions of daily requests. This surge in demand poses significant challenges in optimizing throughput and latency while keeping costs manageable. The Key-Value (KV) cache, a standard method for retaining previous computations, makes LLM inference highly bounded by memory. While batching strategies can enhance performance, they frequently lead to significant memory fragmentation. Even though cutting-edge systems like vLLM mitigate KV cache fragmentation using paged Attention mechanisms, they still suffer from inefficient memory and computational operations due to the tightly coupled page management and computation kernels. This study introduces the vTensor, an innovative tensor structure for LLM inference based on GPU virtual memory management (VMM). vTensor addresses existing limitations by decoupling computation from memory defragmentation and offering dynamic extensibility. Our framework employs a CPU-GPU heterogeneous approach, ensuring efficient, fragmentation-free memory management while accommodating various computation kernels across different LLM architectures. Experimental results indicate that vTensor achieves an average speedup of 1.86x across different models, with up to 2.42x in multi-turn chat scenarios. Additionally, vTensor provides average speedups of 2.12x and 3.15x in kernel evaluation, reaching up to 3.92x and 3.27x compared to SGLang Triton prefix-prefilling kernels and vLLM paged Attention kernel, respectively. Furthermore, it frees approximately 71.25% (57GB) of memory on the NVIDIA A100 GPU compared to vLLM, enabling more memory-intensive workloads.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. 01-ai. Yi-1.5-34b-32k model card, 2024.
  2. Taming throughput-latency tradeoff in llm inference with sarathi-serve, 2024.
  3. Yi: Open foundation models by 01.ai, 2024.
  4. AI@Meta. Llama 3 model card. 2024.
  5. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  10. NVIDIA Corporation. cublas library. https://developer.nvidia.com/cublas.
  11. NVIDIA Corporation. Cuda driver api documentation. https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html#group__CUDA__VA.
  12. NVIDIA Corporation. Introducing low-level gpu virtual memory management. https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/.
  13. A survey on multimodal large language models for autonomous driving, 2023.
  14. M6-rec: Generative pretrained language models are open-ended recommender systems, 2022.
  15. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR, abs/2307.08691, 2023.
  16. Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022.
  17. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024.
  18. Hugging Face. Text Generation Inference. https://huggingface.co/text-generation-inference, 2024.
  19. Empathyear: An open-source avatar multimodal empathetic chatbot, 2024.
  20. flashInfer.ai. flashinfer. https://github.com/flashinfer-ai/flashinfer, 2023.
  21. Cost-efficient large language model serving for multi-turn conversations with cachedattention, 2024.
  22. Prompt cache: Modular attention reuse for low-latency inference, 2024.
  23. Fractal: Joint multi-level sparse pattern tuning of accuracy and performance for dnn pruning. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS ’24, page 416–430, New York, NY, USA, 2024. Association for Computing Machinery.
  24. Accelerating sparse dnn models without hardware-support via tile-wise sparsity. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2020.
  25. SQuant: On-the-fly data-free quantization via diagonal hessian approximation. In International Conference on Learning Representations, 2022.
  26. OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization. In Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA). ACM, 2023.
  27. Accelerating sparse dnns based on tiled gemm. IEEE Transactions on Computers, 2024.
  28. Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1414–1433. IEEE, 2022.
  29. Gmlake: Efficient and transparent gpu memory defragmentation for large-scale dnn training with virtual memory stitching, 2024.
  30. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  31. Inference without interference: Disaggregate llm inference for mixed downstream workloads, 2024.
  32. InternLM. LMDeploy: Toolkit for Compressing, Deploying, and Serving Large Language Models. https://github.com/InternLM/lmdeploy, 2024.
  33. Mixtral of experts, 2024.
  34. Efficient memory management for large language model serving with pagedattention, 2023.
  35. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache, 2024.
  36. AWQ: activation-aware weight quantization for LLM compression and acceleration. CoRR, abs/2306.00978, 2023.
  37. Deja vu: Contextual sparsity for efficient llms at inference time, 2023.
  38. Enhancing educational efficiency: Generative ai chatbots and devops in education 4.0, 2024.
  39. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40, 2023.
  40. NVIDIA. Unified memory in cuda 6, 2013.
  41. NVIDIA. Pascal architecturec gpu, 2016.
  42. NVIDIA. Nvidia tensorrt. https://developer.nvidia.com/tensorrt, 2023.
  43. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  44. OpenLLMAI. vLLM v0.4.2. https://github.com/vllm-project/vllm/releases/tag/v0.4.2, 2024.
  45. Splitwise: Efficient generative llm inference using phase splitting, 2024.
  46. Efficiently scaling transformer inference, 2022.
  47. vattention: Dynamic memory management for serving llms without pagedattention, 2024.
  48. vdnn: virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-49. IEEE Press, 2016.
  49. Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019.
  50. Llumnix: Dynamic scheduling for large language model serving, 2024.
  51. Evaluation of computational and energy performance in matrix multiplication algorithms on cpu and gpu using mkl, cublas and sycl, 2024.
  52. Llama: Open and efficient foundation language models, 2023.
  53. Llama 2: Open foundation and fine-tuned chat models, 2023.
  54. Attention is all you need, 2023.
  55. vLLM. Automatic prefix caching. https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html.
  56. Dual-side sparse tensor core. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1083–1095. IEEE, 2021.
  57. vllm: Easy, fast, and cheap llm serving with pagedattention. https://vllm.ai/, 2023.
  58. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023.
  59. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, Carlsbad, CA, July 2022. USENIX Association.
  60. Lv-eval: A balanced long-context benchmark with 5 length levels up to 256k, 2024.
  61. Opt: Open pre-trained transformer language models, 2022.
  62. Atom: Low-bit quantization for efficient and accurate llm serving. In P. Gibbons, G. Pekhimenko, and C. De Sa, editors, Proceedings of Machine Learning and Systems, volume 6, pages 196–209, 2024.
  63. Sglang: Efficient execution of structured language model programs, 2024.
  64. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Jiale Xu (17 papers)
  2. Rui Zhang (1138 papers)
  3. Cong Guo (63 papers)
  4. Weiming Hu (91 papers)
  5. Zihan Liu (102 papers)
  6. Feiyang Wu (8 papers)
  7. Yu Feng (216 papers)
  8. Shixuan Sun (15 papers)
  9. Changxu Shao (3 papers)
  10. Yuhong Guo (52 papers)
  11. Junping Zhao (6 papers)
  12. Ke Zhang (264 papers)
  13. Minyi Guo (98 papers)
  14. Jingwen Leng (50 papers)
Citations (4)