Iterative Shortest Remaining Time First
- ISRTF is a scheduling strategy for large language model serving systems that predicts the remaining output tokens at each decoding iteration.
- It adapts the classical SRTF method into fixed-size decoding windows to mitigate head-of-line blocking and optimize GPU utilization.
- Empirical evaluations in ELIS demonstrate up to a 19.6% reduction in average job completion time with minimal added overhead.
Iterative Shortest Remaining Time First (ISRTF) is a scheduling strategy designed for LLM serving systems with an auto-regressive, iteration-based inference process. ISRTF is a modification of the classical Shortest Remaining Time First (SRTF) approach and is integrated into batch-oriented LLM environments, such as those leveraging decoder-only architectures. The method’s core principle is to predict, at each decoding iteration, the number of output tokens remaining for every active job and to reorder jobs to serve those with the fewest tokens left. This approach directly targets and mitigates the "head-of-line blocking" phenomenon associated with conventional First-Come-First-Served (FCFS) schedulers, especially in multi-user, high-throughput contexts where job lengths are highly variable. ISRTF has been implemented as the central scheduler within ELIS, a cloud-native LLM serving system, and empirically demonstrates reductions in end-to-end average job completion time (JCT) by up to 19.6% with minimal additional overhead (Choi et al., 14 May 2025).
1. Motivation and Theoretical Foundations
ISRTF addresses distinctive characteristics of LLM workflow, where decoding is performed in fixed-size windows (e.g., 50 tokens per iteration), and prompt streams arrive asynchronously. Under FCFS, long-output jobs can monopolize resources within a batch, causing head-of-line blocking for subsequent short jobs. This results in suboptimal JCT, particularly in dynamic, high-concurrency settings. ISRTF adapts SRTF logic to the iteration-centric batching model by prioritizing jobs within each batch according to a real-time estimate of remaining output tokens. This minimization of expected queuing delay is beneficial in maximizing GPU utilization across variable-length jobs and is especially relevant given the stable per-token decoding cost and the relative unpredictability of output lengths in LLM workloads (Choi et al., 14 May 2025).
2. Scheduler Algorithm and Operational Workflow
The ISRTF scheduler operates in discrete rounds, each aligned with a fixed decoding window (50 tokens/job, as instantiated in ELIS). At each iteration:
- A global job pool maintains all active inference requests.
- For jobs lacking prior predictions, the Response Length Predictor (based on the BGE encoder) is initialized with the prompt; for in-progress jobs, the predictor considers the prompt concatenated with generated tokens to predict the remaining length.
- Jobs are enqueued into per-node priority queues (Priority Buffers) sorted by ascending predicted remaining tokens.
- Each backend worker node forms a batch by selecting the highest-priority jobs up to GPU capacity and processes a decoding window.
- Upon receiving partial outputs, completed jobs are finalized; unfinished jobs have their token histories updated and are re-queued.
- This loop continues until all jobs exit the system.
This iterative prioritization and batching approach ensures adaptive, real-time scheduling, dynamically decreasing batching-induced latency for jobs predicted to be closest to completion (Choi et al., 14 May 2025).
3. Mathematical Models and Predictor Architecture
LLM per-job latency is formalized as
where and are stable constants, and (total output tokens) is the primary source of variance. The Response Length Predictor, leveraging the BGE encoder, is trained on data collected from vLLM-backed LLM serving environments to minimize the prediction error for remaining tokens at arbitrary decoding iterations. At iteration , the predictor infers
with observed performance metrics: MAE = 19.92 tokens, RMSE = 34.33 tokens, . Notably, as more tokens are generated, the prediction error (MAE) reliably declines with each window, enabling increasingly accurate scheduling as jobs progress (Choi et al., 14 May 2025).
4. Integration with LLM Batching, Local Queuing, and Preemption
ISRTF is designed around a fixed window size for decoding (50 tokens/job) and is agnostic to the precise predictor model—the scheduler requires only a scalar “remaining token count,” facilitating future upgrades. Local priority queues on each GPU worker node govern the final in-batch selection, post-assignment from a global load balancer that distributes jobs to nodes with minimal in-flight requests. Communication between frontend and backend is optimized: prompt texts are transmitted only once, with all subsequent iterations involving only partial outputs and updated priority. Preemption is supported by extending vLLM to recognize dynamic per-job priorities; low-priority jobs may be preempted to prevent blocking when GPU resources are stressed. Tuning parameters exist to modulate preemption frequency and mitigate starvation risk within the system (Choi et al., 14 May 2025).
5. Kubernetes-Native Deployment and System Architecture
ELIS, integrating ISRTF, is implemented natively within Kubernetes. The frontend scheduler is deployed as a pod (Deployment or StatefulSet), managing global state through multi-process Python shared-memory constructs. Backend workers, each running vLLM as a StatefulSet pod, are reachable at stable DNS endpoints. Communication between components exploits Kubernetes Services for gRPC/HTTP. Autoscaling is achieved through Horizontal Pod Autoscaler; as request intensity increases, new backend pods are spun up, each immediately assimilated into the scheduling topology. ELIS’s vLLM backend incorporates patches for (i) fixed-window iteration execution and (ii) configurable, per-job priorities overriding vLLM’s standard FCFS behavior (Choi et al., 14 May 2025).
6. Experimental Evaluation and Performance
Evaluation leverages real-world traces from a two-month deployment of Samsung’s “FabriX” LLM service, sampling arrival intervals per a Gamma distribution (, ) and prompt contents from the LMSYS-Chat-1M dataset. Tested LLMs include OPT-6.7B/13B, LLaMA2-7B/13B, and Vicuna-13B, deployed on NVIDIA A100 GPUs. Baseline schedulers are FCFS (the native ORCA-style strategy) and Oracle SJF (idealized shortest-job-first with perfect foresight). At a batch size of 4:
- ISRTF achieved an average JCT reduction of 7.36% against FCFS, reaching 21.40% for LLaMA2-13B at 5× load.
- Scheduling overhead (including batching and prediction) was ≈11 ms (0.13% of mean 8.6 s latency).
- Primary performance gains stemmed from 16.7% reduction in queuing delay in high-load scenarios.
- Batch size sensitivity analysis demonstrated that ISRTF outperforms FCFS for all measured batch sizes (1, 2, 4), with up to 19.58% JCT reduction for batch size 1 at base (1×) workload.
- Scalability tests on NVIDIA H100 clusters showed near-linear increase in peak throughput (RPS) as backend worker count scales from 10 to 50, with maintained queue delays below 0.5 s.
These results demonstrate ISRTF’s efficacy in real-world, production-level LLM serving and validate its suitability for high-concurrency, cloud-native deployments (Choi et al., 14 May 2025).
7. Summary and Broader Implications
ISRTF formalizes a lightweight, iteration-aware adaptation of SRTF scheduling for batched LLM inference systems. Through accurate, iterative prediction of remaining output length and dynamic job reordering, ISRTF achieves significant reductions in average job completion time without incurring substantial computational overhead or necessitating major architectural changes to existing LLM serving backends. Its modular design and integration with orchestration frameworks such as Kubernetes permit straightforward adoption in a variety of LLM service architectures. A plausible implication is that continued advances in response length prediction models will further enhance the effectiveness of ISRTF and related dynamic scheduling methodologies within AI-serving infrastructure (Choi et al., 14 May 2025).