Efficient LLM Scheduling by Learning to Rank (2408.15792v1)

Published 28 Aug 2024 in cs.LG

Abstract: In LLM inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at https://github.com/hao-ai-lab/vLLM-ltr.git

Citations (6)

View on Semantic Scholar

Summary

The paper introduces a novel learning-to-rank framework for scheduling large language model (LLM) requests based on the relative ordering of predicted output lengths, circumventing the need for precise length predictions.
This learning-to-rank approach significantly improves performance, demonstrating up to a 6.9x reduction in latency and a 6.5x increase in throughput compared to traditional First-come-first-serve (FCFS) scheduling.
The proposed method utilizes a small, robust model that can be easily integrated into existing LLM serving systems with minimal overhead, making it a practical solution for optimizing LLM deployment efficiency.

Efficient LLM Scheduling by Learning to Rank

The paper, titled "Efficient LLM Scheduling by Learning to Rank," addresses a critical bottleneck in the deployment of LLMs: effective scheduling to enhance throughput and reduce latency. With LLMs becoming integral to numerous internet applications, optimizing their serving infrastructure is crucial. This research takes a novel approach by leveraging the relative ordering of requests based on their predicted output length, utilizing a learning-to-rank framework.

Overview of Approach

Traditional LLM serving systems often rely on First-come-first-serve (FCFS) scheduling, which can lead to inefficiencies due to Head-Of-Line (HOL) blocking. In such setups, long requests delay the processing of shorter ones, resulting in higher average latency and reduced throughput. While alternative scheduling strategies like Shortest-job-first (SJF) or Shortest-remaining-time-first (SRTF) can theoretically minimize latency, they are rarely implemented in practice due to the challenge of predicting request lengths in advance.

The authors propose an innovative solution that circumvents the need for precise length predictions. Instead, they focus on determining the relative order of request lengths using a machine learning model trained to rank. By employing the Kendall rank correlation coefficient (Kendall's Tau), they assess the alignment between predicted and actual orderings. A higher Kendall’s Tau correlates with performance gains in latency reduction and throughput.

Key Contributions and Findings

Learning to Rank for Scheduling: The researchers utilize a small auxiliary model, leveraging the OPT architecture, trained to rank LLM requests by predicted generation lengths. This learning-to-rank framework is shown to approximate SRTF/SJF scheduling effectively, with superior performance compared to direct length prediction methods.
Kendall's Tau as a Performance Measure: By adopting Kendall's Tau to evaluate the effectiveness of scheduling order predictions, the paper highlights that improved rank prediction robustness leads to lower latency and higher throughput in real-world applications.
Starvation Prevention: The paper addresses the potential starvation of longer requests by implementing a mechanism to elevate request priority after certain wait periods, balancing fairness with efficiency.
Practical Integration: The robustness and simplicity of implementing the learning-to-rank model mean it can be easily integrated into existing LLM serving infrastructure, such as vLLM, with minimal overhead (measured at approximately 2% during testing).
Significant Performance Gains: The proposed method demonstrates substantial improvements over conventional and contemporary scheduling methods. Notably, it achieves up to a $6.9\times$ latency reduction compared to FCFS and enhances throughput by as much as $6.5\times$ in synthetic data generation tasks.

Implications and Future Directions

This research holds significant implications for the deployment of LLMs in high-demand environments. By providing a mechanism to effectively schedule based on predicted rank, it optimizes resource utilization without exhaustive computational overheads. The methodology offers a balance between performance enhancement and operational simplicity, marking a step forward in LLM service design.

Future directions may explore further refinements in rank prediction models, potentially integrating more advanced features such as context awareness or adaptive learning based on real-time data shifts. Additionally, expanding the model's applicability to even larger and more varied datasets could test the limits and scalability of this approach.

Overall, this paper contributes a meaningful advancement in LLM infrastructure optimization, demonstrating the utility of learning-to-rank strategies in practical settings and setting the stage for subsequent innovations in this space.

PDF Markdown

Related Papers

Tweets

https://twitter.com/haoailab/status/1879254927975018824

https://twitter.com/haoailab/status/1879254940138582027

https://twitter.com/gm8xx8/status/1828985617176256886

YouTube

Show All Videos