Towards Pareto Optimal Throughput in Small Language Model Serving (2404.03353v1)

Published 4 Apr 2024 in cs.CL

Abstract: LLMs have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small LLMs (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.

References (35)

Authors (8)

Pol G. Recasens (5 papers)
Yue Zhu (44 papers)
Chen Wang (600 papers)
Eun Kyung Lee (6 papers)
Olivier Tardieu (6 papers)
Alaa Youssef (7 papers)
Jordi Torres (25 papers)
Josep Ll. Berral (9 papers)

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Towards Pareto Optimal Throughput in Small Language Model Serving (2404.03353v1)

Summary

Related Papers