MC-SF Scheduling: Memory-Constrained LLM Serving
- Memory-Constrained Shortest-First (MC-SF) is a scheduling heuristic that prioritizes jobs by their decode lengths while managing dynamically changing KV-cache memory during LLM serving.
- The algorithm builds batches greedily by adding jobs in ascending decode order, ensuring that the cumulative memory usage remains within set limits for future processing steps.
- MC-SF demonstrates a suboptimal competitive ratio of Ω(√M) under memory heterogeneity, which has motivated the development of improved strategies like the Sorted-F rule.
The memory-constrained Shortest-First (MC-SF) algorithm is a scheduling heuristic designed for online LLM serving with heterogeneous user requests. Each request consists of an input “prefill” phase (prompt tokens) and a sequential “decode” phase (output tokens), both of which consume dynamically changing key-value (KV) cache memory. MC-SF extends the classical Shortest-First (SF) principle to this modern, memory-bounded setting in which batching and memory utilization provide significant trade-offs. The approach, initially analyzed in the context of LLM-serving optimization by Jaillet et al. (2025), is notable for its suboptimality under memory heterogeneity and motivates robust alternatives based on more nuanced selection metrics (Wang et al., 8 Aug 2025).
1. Formal Model of Memory-Constrained LLM Scheduling
Consider jobs, all arriving at time $0$, each specified by a prefill length (input prompt tokens) and decode length (number of sequential output tokens to be generated), for . A single GPU worker provides limited KV-cache memory, . When any request has generated output tokens (), its KV-cache memory footprint is
At each discrete time , a batch of pending jobs can be started or continued—performing exactly one parallel decode step per job—subject to the requirement that, for all future , the total memory across all jobs (started and pending) does not exceed :
where is the set of already-started, unfinished jobs. Request completes when its -th output token is generated; total end-to-end latency (TEL) is
where is the completion time of request under schedule (Wang et al., 8 Aug 2025).
2. The Memory-Constrained Shortest-First (MC-SF) Rule
The MC-SF scheduling rule prioritizes pending jobs in ascending order of decode length , greedily filling each batch with the “shortest-decode” jobs as long as the cumulative worst-case memory across all jobs (current set plus candidate additions) remains feasible for all future steps. Specifically, at time :
- Define as the set of pending jobs.
- Sort by .
- Iteratively add jobs to a candidate batch from this ordering until including another job would violate:
- All jobs in are launched in parallel for one decode token, and completed jobs are removed; the process repeats at (Wang et al., 8 Aug 2025).
3. Competitive Ratio Lower Bound for MC-SF
The competitive ratio (CR) of a scheduling heuristic is defined as the worst-case ratio over all instances of its total end-to-end latency (TEL) to that of the offline optimal. For MC-SF, the following holds:
specifically demonstrated via a constructed instance with
as increases (Wang et al., 8 Aug 2025). Thus, MC-SF's relative inefficiency grows without bound in the memory-rich regime.
4. Construction and Analysis of the Ω(√M) Lower Bound
A worst-case scenario illustrating MC-SF’s suboptimality is constructed as follows:
- Type 1 jobs: requests with
- Type 2 jobs: requests with
Under MC-SF, all Type 1 jobs (with ) run first, each requiring about memory per request, so only can run in parallel per batch, necessitating approximately batches (waves) and producing total TEL . The subsequent Type 2 jobs (with ) are processed in larger batches, with total TEL . In contrast, the optimal schedule reverses the order, with both job types processed in parallel batches of size , resulting in TEL for all jobs—yielding a competitive ratio (Wang et al., 8 Aug 2025).
5. Pseudocode Implementation and Computational Complexity
The MC-SF algorithm’s workflow and computational characteristics are as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
Algorithm MC-SF Input: jobs i=1..n with (p_i,d_i), memory M. Initialize t←0, S←∅, R←{1,…,n}. while S∪R≠∅ do t←t+1 U←∅ for i in R sorted by ascending d_i do if adding i to U∪S keeps max_{t'∈[t,t+max_{j∈U}d_j]} ∑_{j∈S∪U}(p_j + (t'-start_j)) ≤ M then U←U∪{i} else break end for R←R\U; S←S∪U // Process one token of every job in S in parallel for j in S do if j has completed d_j tokens then remove from S and record c_j=t end for end while return the completion times {c_i}. |
6. Regimes of Failure and Algorithmic Remedies
MC-SF fails when prompt sizes vary significantly compared to decode lengths , as simply sorting by disregards substantial KV-cache occupancy by jobs with large . The aforementioned lower bound exploits such heterogeneity. The principal remedies, as introduced in (Wang et al., 8 Aug 2025), include:
- Batch quality metric : For any batch , define
and at each time step, select the subset minimizing (subject to memory capacity). This “Sorted-F” rule balances prompt size and decode length.
- Constant competitive ratio: Sorted-F achieves a constant CR (), independent of .
- Accelerated approximations: Further speedups are enabled by dynamic programming, local search, quantile-greedy strategies, and LP-based heuristics.
The incorporation of both and in selection resolves MC-SF’s vulnerability, particularly under non-uniform prompt distributions (Wang et al., 8 Aug 2025).
7. Context and Significance
MC-SF formalizes a practical but fundamentally limited class of greedy scheduling policies for modern LLM service systems subject to severe KV-cache constraints. While effective for jobs with uniform or near-uniform prompt sizes, it demonstrates provable inefficiency under prompt heterogeneity, as established both analytically and with explicit lower bounds. These findings have prompted the adoption of more sophisticated metrics and batch selection strategies that incorporate both prefill and decode characteristics. The Sorted-F rule, as well as dynamically informed batch optimization, now represent the state-of-the-art in this scheduling regime, substantially outperforming MC-SF and related heuristics while retaining computational practicality (Wang et al., 8 Aug 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free