Towards Pareto Optimal Throughput in Small Language Model Serving (2404.03353v1)
Abstract: LLMs have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small LLMs (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.
- Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
- Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models. arXiv preprint arXiv:2401.00625 (2024).
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
- Ronen Eldan and Yuanzhi Li. 2023. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv preprint arXiv:2305.07759 (2023).
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).
- Prompt cache: Modular attention reuse for low-latency inference. arXiv preprint arXiv:2311.04934 (2023).
- Textbooks Are All You Need. arXiv preprint arXiv:2306.11644 (2023).
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022).
- HuggingFace. 2023. Text Generation Inference. https://huggingface.co/docs/text-generation-inference/index.
- S3superscript𝑆3S^{3}italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT: Increasing GPU Utilization during Generative Inference for Higher Throughput. arXiv preprint arXiv:2306.06000 (2023).
- SqueezeLLM: Dense-and-Sparse Quantization. arXiv preprint arXiv:2306.07629 (2023).
- Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017 (2023).
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626.
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. arXiv preprint arXiv:2302.11665 (2023).
- Microsoft. 2023. DeepSpeed-FastGen. https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen.
- NVIDIA. 2023a. FasterTransformer. https://github.com/NVIDIA/FasterTransformer.
- NVIDIA. 2023b. GPU Performance Background User’s Guide. https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#understand-perf.
- NVIDIA. 2024. DCGM User guide: Feature overview. https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023).
- PyTorch. 2024. Accelerating PyTorch with CUDA Graphs. https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/.
- From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–9.
- ShareGPT. 2023. ShareGPT. https://sharegpt.com/.
- Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150 (2019).
- Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning. PMLR, 31094–31116.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
- MosaicML NLP Team. 2023. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs. www.mosaicml.com/blog/mpt-7b Accessed: 2023-05-05.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- vLLM. 2024. Project update. https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/mobilepresent?slide=id.g2b46085d608_1_0.
- Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
- Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 97–110.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations. 38–45.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR, 38087–38099.
- Orca: A distributed serving system for Transformer-Based generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
- Pol G. Recasens (5 papers)
- Yue Zhu (44 papers)
- Chen Wang (600 papers)
- Eun Kyung Lee (6 papers)
- Olivier Tardieu (6 papers)
- Alaa Youssef (7 papers)
- Jordi Torres (25 papers)
- Josep Ll. Berral (9 papers)