Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Pareto Optimal Throughput in Small Language Model Serving (2404.03353v1)

Published 4 Apr 2024 in cs.CL

Abstract: LLMs have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small LLMs (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.
  2. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
  3. Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models. arXiv preprint arXiv:2401.00625 (2024).
  4. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
  5. Ronen Eldan and Yuanzhi Li. 2023. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv preprint arXiv:2305.07759 (2023).
  6. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).
  7. Prompt cache: Modular attention reuse for low-latency inference. arXiv preprint arXiv:2311.04934 (2023).
  8. Textbooks Are All You Need. arXiv preprint arXiv:2306.11644 (2023).
  9. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022).
  10. HuggingFace. 2023. Text Generation Inference. https://huggingface.co/docs/text-generation-inference/index.
  11. S3superscript𝑆3S^{3}italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT: Increasing GPU Utilization during Generative Inference for Higher Throughput. arXiv preprint arXiv:2306.06000 (2023).
  12. SqueezeLLM: Dense-and-Sparse Quantization. arXiv preprint arXiv:2306.07629 (2023).
  13. Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017 (2023).
  14. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626.
  15. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. arXiv preprint arXiv:2302.11665 (2023).
  16. Microsoft. 2023. DeepSpeed-FastGen. https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen.
  17. NVIDIA. 2023a. FasterTransformer. https://github.com/NVIDIA/FasterTransformer.
  18. NVIDIA. 2023b. GPU Performance Background User’s Guide. https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#understand-perf.
  19. NVIDIA. 2024. DCGM User guide: Feature overview. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html.
  20. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023).
  21. PyTorch. 2024. Accelerating PyTorch with CUDA Graphs. https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/.
  22. From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–9.
  23. ShareGPT. 2023. ShareGPT. https://sharegpt.com/.
  24. Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150 (2019).
  25. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning. PMLR, 31094–31116.
  26. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
  27. MosaicML NLP Team. 2023. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs. www.mosaicml.com/blog/mpt-7b Accessed: 2023-05-05.
  28. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  29. vLLM. 2024. Project update. https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/mobilepresent?slide=id.g2b46085d608_1_0.
  30. Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  31. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 97–110.
  32. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. 38–45.
  33. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR, 38087–38099.
  34. Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538.
  35. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Pol G. Recasens (5 papers)
  2. Yue Zhu (44 papers)
  3. Chen Wang (599 papers)
  4. Eun Kyung Lee (6 papers)
  5. Olivier Tardieu (6 papers)
  6. Alaa Youssef (7 papers)
  7. Jordi Torres (25 papers)
  8. Josep Ll. Berral (9 papers)
Citations (2)