2000 character limit reached

MC-SF Scheduling: Memory-Constrained LLM Serving

Updated 18 November 2025

Memory-Constrained Shortest-First (MC-SF) is a scheduling heuristic that prioritizes jobs by their decode lengths while managing dynamically changing KV-cache memory during LLM serving.
The algorithm builds batches greedily by adding jobs in ascending decode order, ensuring that the cumulative memory usage remains within set limits for future processing steps.
MC-SF demonstrates a suboptimal competitive ratio of Ω(√M) under memory heterogeneity, which has motivated the development of improved strategies like the Sorted-F rule.

The memory-constrained Shortest-First (MC-SF) algorithm is a scheduling heuristic designed for online LLM serving with heterogeneous user requests. Each request consists of an input “prefill” phase (prompt tokens) and a sequential “decode” phase (output tokens), both of which consume dynamically changing key-value (KV) cache memory. MC-SF extends the classical Shortest-First (SF) principle to this modern, memory-bounded setting in which batching and memory utilization provide significant trade-offs. The approach, initially analyzed in the context of LLM-serving optimization by Jaillet et al. (2025), is notable for its suboptimality under memory heterogeneity and motivates robust alternatives based on more nuanced selection metrics (Wang et al., 8 Aug 2025).

1. Formal Model of Memory-Constrained LLM Scheduling

Consider $n$ jobs, all arriving at time $0$, each specified by a prefill length $p_i$ (input prompt tokens) and decode length $d_i$ (number of sequential output tokens to be generated), for $i \in \{1, \dotsc, n\}$ . A single GPU worker provides limited KV-cache memory, $M$ . When any request $i$ has generated $a_i$ output tokens ( $0 \leq a_i \leq d_i$ ), its KV-cache memory footprint is

$\mathrm{mem}_i(a_i) = p_i + a_i.$

At each discrete time $t$ , a batch $\mathcal{U}_t$ of pending jobs can be started or continued—performing exactly one parallel decode step per job—subject to the requirement that, for all future $t' \geq t$ , the total memory across all jobs (started and pending) does not exceed $M$ :

$\sum_{i \in S^{(t)}} (p_i + (t' - p_i)) \mathbf{1}_{\{t' - p_i \leq d_i\}} + \sum_{i \in \mathcal{U}_t} (p_i + (t' - t)) \mathbf{1}_{\{t' - t \leq d_i\}} \leq M,$

where $S^{(t)}$ is the set of already-started, unfinished jobs. Request $i$ completes when its $d_i$ -th output token is generated; total end-to-end latency (TEL) is

$\mathrm{TEL}(\Lambda) = \sum_{i=1}^n c_i,$

where $c_i$ is the completion time of request $i$ under schedule $\Lambda$ (Wang et al., 8 Aug 2025).

2. The Memory-Constrained Shortest-First (MC-SF) Rule

The MC-SF scheduling rule prioritizes pending jobs in ascending order of decode length $d_i$ , greedily filling each batch with the “shortest-decode” jobs as long as the cumulative worst-case memory across all jobs (current set plus candidate additions) remains feasible for all future steps. Specifically, at time $t$ :

Define $R^{(t)}$ as the set of pending jobs.
Sort $R^{(t)}$ by $d_i$ .
Iteratively add jobs to a candidate batch $U$ from this ordering until including another job would violate:

$\max_{t' \in [t,\,t+\max_{i \in U} d_i]} \left[ \sum_{i \in S^{(t)}} (p_i + t' - p_i) \mathbf{1}_{\{t' - p_i \leq d_i\}} + \sum_{i \in U} (p_i + t' - t) \mathbf{1}_{\{t' - t \leq d_i\}} \right] \leq M.$

All jobs in $U$ are launched in parallel for one decode token, and completed jobs are removed; the process repeats at $t+1$ (Wang et al., 8 Aug 2025).

3. Competitive Ratio Lower Bound for MC-SF

The competitive ratio (CR) of a scheduling heuristic is defined as the worst-case ratio over all instances of its total end-to-end latency (TEL) to that of the offline optimal. For MC-SF, the following holds:

$\mathrm{CR}(\mathrm{MC\text{-}SF}) = \sup_{\mathcal I} \frac{\mathrm{TEL}(\mathrm{MC\text{-}SF};\mathcal I)} {\mathrm{TEL}(\mathrm{Optimal};\mathcal I)} \geq \Omega(\sqrt{M}), \qquad M \to \infty,$

specifically demonstrated via a constructed instance with

$\frac{\mathrm{TEL}(\mathrm{MC\text{-}SF})}{\mathrm{TEL}(\Lambda_\mathrm{opt})} \geq \frac{2}{13} \sqrt{M} \to \infty$

as $M$ increases (Wang et al., 8 Aug 2025). Thus, MC-SF's relative inefficiency grows without bound in the memory-rich regime.

4. Construction and Analysis of the Ω(√M) Lower Bound

A worst-case scenario illustrating MC-SF’s suboptimality is constructed as follows:

Type 1 jobs: $X = M$ requests with $(p = \sqrt{M} - 1,\ d = 1)$
Type 2 jobs: $Y = M^{1.5}$ requests with $(p = 1,\ d = 2)$

Under MC-SF, all Type 1 jobs (with $d=1$ ) run first, each requiring about $\sqrt{M}$ memory per request, so only $\sqrt{M}$ can run in parallel per batch, necessitating approximately $\sqrt{M}$ batches (waves) and producing total TEL $\sim O(M^2)$ . The subsequent Type 2 jobs (with $d=2$ ) are processed in larger batches, with total TEL $\sim O(M^{1.5})$ . In contrast, the optimal schedule reverses the order, with both job types processed in parallel batches of size $\sim M/3$ , resulting in TEL $\sim O(M^{1.5})$ for all jobs—yielding a competitive ratio $\sim O(\sqrt{M})$ (Wang et al., 8 Aug 2025).

5. Pseudocode Implementation and Computational Complexity

The MC-SF algorithm’s workflow and computational characteristics are as follows:

Algorithm MC-SF
Input: jobs i=1..n with (p_i,d_i), memory M.
Initialize t←0, S←∅, R←{1,…,n}.
while S∪R≠∅ do
  t←t+1
  U←∅
  for i in R sorted by ascending d_i do
    if adding i to U∪S keeps
       max_{t'∈[t,t+max_{j∈U}d_j]}
        ∑_{j∈S∪U}(p_j + (t'-start_j)) ≤ M
    then U←U∪{i}
    else break
  end for
  R←R\U;  S←S∪U
  // Process one token of every job in S in parallel
  for j in S do
    if j has completed d_j tokens then remove from S and record c_j=t
  end for
end while
return the completion times {c_i}.

At each time step

t

, up to

n

jobs are scanned and for each, an

O(M)

memory feasibility check is required, or equivalently an

O(|U|)

future-step search. The number of time steps is bounded by

\sum d_i

, leading to a worst-case runtime of

O(n \cdot (\sum d_i) \cdot (n + M))

, precluding practical use for

n

M

in the thousands (Wang et al., 8 Aug 2025).

6. Regimes of Failure and Algorithmic Remedies

MC-SF fails when prompt sizes $p_i$ vary significantly compared to decode lengths $d_i$ , as simply sorting by $d_i$ disregards substantial KV-cache occupancy by jobs with large $p_i$ . The aforementioned lower bound exploits such heterogeneity. The principal remedies, as introduced in (Wang et al., 8 Aug 2025), include:

Batch quality metric $F(X)$ : For any batch $X$ , define

$F(X) = \frac{\sum_{i \in X} d_i}{|X|^2},$

and at each time step, select the subset $X \subseteq R$ minimizing $F(X)$ (subject to memory capacity). This “Sorted-F” rule balances prompt size and decode length.

Constant competitive ratio: Sorted-F achieves a constant CR ( $\leq 48$ ), independent of $M$ .
Accelerated approximations: Further speedups are enabled by dynamic programming, local search, quantile-greedy strategies, and LP-based heuristics.

The incorporation of both $p_i$ and $d_i$ in selection resolves MC-SF’s vulnerability, particularly under non-uniform prompt distributions (Wang et al., 8 Aug 2025).

7. Context and Significance

MC-SF formalizes a practical but fundamentally limited class of greedy scheduling policies for modern LLM service systems subject to severe KV-cache constraints. While effective for jobs with uniform or near-uniform prompt sizes, it demonstrates provable inefficiency under prompt heterogeneity, as established both analytically and with explicit lower bounds. These findings have prompted the adoption of more sophisticated metrics and batch selection strategies that incorporate both prefill and decode characteristics. The Sorted-F rule, as well as dynamically informed batch optimization, now represent the state-of-the-art in this scheduling regime, substantially outperforming MC-SF and related heuristics while retaining computational practicality (Wang et al., 8 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

LLM Serving Optimization with Variable Prefill and Decode Lengths (2025)

Follow Topic

Get notified by email when new papers are published related to Memory-Constrained Shortest-First Scheduling Algorithm (MC-SF).