In the rapidly evolving landscape of AI, generative LLMs stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.
We're not able to analyze this paper right now due to high demand.
Please check back later (sorry!).
Sign up for a free account or log in to generate a summary of this paper:
We ran into a problem analyzing this paper.
NVIDIA Effective Transformer. https://github.com/bytedance/effective_transformer. Commit: e406421, Accessed on: 2023-11-25.
NVIDIA FasterTransformer. https://github.com/NVIDIA/FasterTransformer. Commit: df4a753, Accessed on: 2023-11-25.
DeepSpeed Inference. https://github.com/microsoft/DeepSpeed. Commit: 2afa1c7, Accessed on: 2023-11-25.
NVIDIA H100 Tensor Core GPU Architecture. https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper. Accessed on: 2023-11-25.
AnyScale LLMPerf leaderboard. https://github.com/ray-project/llmperf-leaderboard. Accessed on: 2023-12-23.
AWS Inferentia. https://aws.amazon.com/blogs/machine-learning/deploy-large-language-models-on-aws-inferentia2-using-large-model-inference-containers/.
ChatGLM2-6B. https://huggingface.co/THUDM/chatglm2-6b.
CTranslate2. https://github.com/OpenNMT/CTranslate2. Commit: d963499, Accessed on: 2023-11-25.
2023a. DeepSpeed-FastGen. https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen. Accessed on: 2023-11-25.
DeepSpeed-Inference v.s. ZeRO-Inference. https://github.com/microsoft/DeepSpeed/issues/4234. Accessed on: 2023-11-25.
2023b. DeepSpeed-MII. https://github.com/microsoft/DeepSpeed-MII. Commit: f34b772, Accessed on: 2023-11-25.
2023a. FlexFlow-Serve. https://github.com/Flexflow/FlexFlow/tree/inference. Commit: 672cdad, Accessed on: 2023-11-25.
2023b. FlexGen. https://github.com/FMInference/FlexGen. Commit: d34f7b4, Accessed on: 2023-11-25.
ggml. https://github.com/ggerganov/ggml. Commit: a5e4560, Accessed on: 2023-11-25.
gpt-fast. https://github.com/pytorch-labs/gpt-fast. Commit: 8c8c463, Accessed on: 2023-12-23.
Graphcore. https://www.graphcore.ai/posts/dolly-2.0-open-source-language-model-with-chatgpt-like-interactivity.
Graphcore PopTransformer. https://github.com/graphcore/PopTransformer. Commit: 1314598, Accessed on: 2023-11-25.
Huggingface Text Generation Inference. https://github.com/huggingface/text-generation-inference. Commit: 3c02262, Accessed on: 2023-11-25.
Intel Extension for Transformers. https://github.com/intel/intel-extension-for-transformers. Commit: 37d4007, Accessed on: 2023-12-23.
InterLM LMDeploy. https://github.com/InternLM/lmdeploy. Commit: c07f60f, Accessed on: 2023-11-25.
LightLLM. https://github.com/ModelTC/lightllm. Commit: 84671a7, Accessed on: 2023-11-25.
Llama-v2-7b benchmark. https://hamel.dev/notes/llm/inference/03_inference.html. Accessed on: 2023-11-25.
NVIDIA cuDNN MultiHeadAttn. https://docs.nvidia.com/deeplearning/cudnn/api/index.html##cudnnMultiHeadAttnForward. Accessed on: 2023-11-25.
NVIDIA CUTLASS. https://github.com/NVIDIA/cutlass. Commit: b5d8a5d, Accessed on: 2023-11-25.
NVIDIA TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM. Commit: 6837c81, Accessed on: 2023-11-25.
OpenLLM. https://github.com/bentoml/OpenLLM. Commit: b4ea4b3, Accessed on: 2023-11-25.
RayLLM. https://github.com/ray-project/ray-llm. Commit: fa3a766, Accessed on: 2023-11-25.
Sambanova. https://sambanova.ai/press/sambanova-unveils-new-chip-the-sn40l/.
vLLM. https://github.com/vllm-project/vllm. Commit: 7c60044, Accessed on: 2023-11-25.
Xorbits Inference (Xinference). https://github.com/xorbitsai/inference. Commit: 22732d8, Accessed on: 2023-11-25.
Medusa: Simple framework for accelerating llm generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa. Commit: dd9c8a5, Accessed on: 2023-11-25.
Carol Chen. 2022. Transformer Inference Arithmetic. https://kipp.ly/blog/transformer-inference-arithmetic/. Accessed on: 2023-11-25.
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
Databricks. 2023. LLM Inference Performance Engineering: Best Practices. https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices Accessed on: 2023-11-25.
Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding. https://lmsys.org/blog/2023-11-21-lookahead-decoding/
xFormers: A modular and hackable Transformer modelling library. https://github.com/facebookresearch/xformers. Commit: fbf349a, Accessed on: 2023-11-25.
Qualcomm. 2023. The future of AI is hybrid. https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets/documents/Whitepaper-The-future-of-AI-is-hybrid-Part-2-Qualcomm-is-uniquely-positioned-to-scale-hybrid-AI.pdf. Accessed on: 2023-11-25.
Efficient Transformers: A Survey. ACM Comput. Surv. 55, 6 (2023), 109:1–109:28. https://doi.org/10.1145/3530811
DeciAI Research Team. 2023. DeciLM 6B. https://huggingface.co/Deci/DeciLM-6b
MLC team. 2023. MLC-LLM. https://github.com/mlc-ai/mlc-llm Commit: 3358029, Accessed on: 2023-11-25.
Francisco Massa Grigory Sizov Tri Dao, Daniel Haziza. [n. d.]. Flash-Decoding for long-context inference, year = 2023, = https://pytorch.org/blog/flash-decoding/,.
Sharing Attention Weights for Fast Transformer. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, Sarit Kraus (Ed.). ijcai.org, 5292–5298. https://doi.org/10.24963/ijcai.2019/735