Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 104 tok/s
Gemini 3.0 Pro 36 tok/s Pro
Gemini 2.5 Flash 133 tok/s Pro
Kimi K2 216 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

MC-SF Scheduling: Memory-Constrained LLM Serving

Updated 18 November 2025
  • Memory-Constrained Shortest-First (MC-SF) is a scheduling heuristic that prioritizes jobs by their decode lengths while managing dynamically changing KV-cache memory during LLM serving.
  • The algorithm builds batches greedily by adding jobs in ascending decode order, ensuring that the cumulative memory usage remains within set limits for future processing steps.
  • MC-SF demonstrates a suboptimal competitive ratio of Ω(√M) under memory heterogeneity, which has motivated the development of improved strategies like the Sorted-F rule.

The memory-constrained Shortest-First (MC-SF) algorithm is a scheduling heuristic designed for online LLM serving with heterogeneous user requests. Each request consists of an input “prefill” phase (prompt tokens) and a sequential “decode” phase (output tokens), both of which consume dynamically changing key-value (KV) cache memory. MC-SF extends the classical Shortest-First (SF) principle to this modern, memory-bounded setting in which batching and memory utilization provide significant trade-offs. The approach, initially analyzed in the context of LLM-serving optimization by Jaillet et al. (2025), is notable for its suboptimality under memory heterogeneity and motivates robust alternatives based on more nuanced selection metrics (Wang et al., 8 Aug 2025).

1. Formal Model of Memory-Constrained LLM Scheduling

Consider nn jobs, all arriving at time $0$, each specified by a prefill length pip_i (input prompt tokens) and decode length did_i (number of sequential output tokens to be generated), for i{1,,n}i \in \{1, \dotsc, n\}. A single GPU worker provides limited KV-cache memory, MM. When any request ii has generated aia_i output tokens (0aidi0 \leq a_i \leq d_i), its KV-cache memory footprint is

memi(ai)=pi+ai.\mathrm{mem}_i(a_i) = p_i + a_i.

At each discrete time tt, a batch Ut\mathcal{U}_t of pending jobs can be started or continued—performing exactly one parallel decode step per job—subject to the requirement that, for all future ttt' \geq t, the total memory across all jobs (started and pending) does not exceed MM:

iS(t)(pi+(tpi))1{tpidi}+iUt(pi+(tt))1{ttdi}M,\sum_{i \in S^{(t)}} (p_i + (t' - p_i)) \mathbf{1}_{\{t' - p_i \leq d_i\}} + \sum_{i \in \mathcal{U}_t} (p_i + (t' - t)) \mathbf{1}_{\{t' - t \leq d_i\}} \leq M,

where S(t)S^{(t)} is the set of already-started, unfinished jobs. Request ii completes when its did_i-th output token is generated; total end-to-end latency (TEL) is

TEL(Λ)=i=1nci,\mathrm{TEL}(\Lambda) = \sum_{i=1}^n c_i,

where cic_i is the completion time of request ii under schedule Λ\Lambda (Wang et al., 8 Aug 2025).

2. The Memory-Constrained Shortest-First (MC-SF) Rule

The MC-SF scheduling rule prioritizes pending jobs in ascending order of decode length did_i, greedily filling each batch with the “shortest-decode” jobs as long as the cumulative worst-case memory across all jobs (current set plus candidate additions) remains feasible for all future steps. Specifically, at time tt:

  • Define R(t)R^{(t)} as the set of pending jobs.
  • Sort R(t)R^{(t)} by did_i.
  • Iteratively add jobs to a candidate batch UU from this ordering until including another job would violate:

maxt[t,t+maxiUdi][iS(t)(pi+tpi)1{tpidi}+iU(pi+tt)1{ttdi}]M.\max_{t' \in [t,\,t+\max_{i \in U} d_i]} \left[ \sum_{i \in S^{(t)}} (p_i + t' - p_i) \mathbf{1}_{\{t' - p_i \leq d_i\}} + \sum_{i \in U} (p_i + t' - t) \mathbf{1}_{\{t' - t \leq d_i\}} \right] \leq M.

  • All jobs in UU are launched in parallel for one decode token, and completed jobs are removed; the process repeats at t+1t+1 (Wang et al., 8 Aug 2025).

3. Competitive Ratio Lower Bound for MC-SF

The competitive ratio (CR) of a scheduling heuristic is defined as the worst-case ratio over all instances of its total end-to-end latency (TEL) to that of the offline optimal. For MC-SF, the following holds:

CR(MC-SF)=supITEL(MC-SF;I)TEL(Optimal;I)Ω(M),M,\mathrm{CR}(\mathrm{MC\text{-}SF}) = \sup_{\mathcal I} \frac{\mathrm{TEL}(\mathrm{MC\text{-}SF};\mathcal I)} {\mathrm{TEL}(\mathrm{Optimal};\mathcal I)} \geq \Omega(\sqrt{M}), \qquad M \to \infty,

specifically demonstrated via a constructed instance with

TEL(MC-SF)TEL(Λopt)213M\frac{\mathrm{TEL}(\mathrm{MC\text{-}SF})}{\mathrm{TEL}(\Lambda_\mathrm{opt})} \geq \frac{2}{13} \sqrt{M} \to \infty

as MM increases (Wang et al., 8 Aug 2025). Thus, MC-SF's relative inefficiency grows without bound in the memory-rich regime.

4. Construction and Analysis of the Ω(√M) Lower Bound

A worst-case scenario illustrating MC-SF’s suboptimality is constructed as follows:

  • Type 1 jobs: X=MX = M requests with (p=M1, d=1)(p = \sqrt{M} - 1,\ d = 1)
  • Type 2 jobs: Y=M1.5Y = M^{1.5} requests with (p=1, d=2)(p = 1,\ d = 2)

Under MC-SF, all Type 1 jobs (with d=1d=1) run first, each requiring about M\sqrt{M} memory per request, so only M\sqrt{M} can run in parallel per batch, necessitating approximately M\sqrt{M} batches (waves) and producing total TEL O(M2)\sim O(M^2). The subsequent Type 2 jobs (with d=2d=2) are processed in larger batches, with total TEL O(M1.5)\sim O(M^{1.5}). In contrast, the optimal schedule reverses the order, with both job types processed in parallel batches of size M/3\sim M/3, resulting in TEL O(M1.5)\sim O(M^{1.5}) for all jobs—yielding a competitive ratio O(M)\sim O(\sqrt{M}) (Wang et al., 8 Aug 2025).

5. Pseudocode Implementation and Computational Complexity

The MC-SF algorithm’s workflow and computational characteristics are as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Algorithm MC-SF
Input: jobs i=1..n with (p_i,d_i), memory M.
Initialize t0, S, R{1,,n}.
while SR do
  tt+1
  U
  for i in R sorted by ascending d_i do
    if adding i to US keeps
       max_{t'∈[t,t+max_{j∈U}d_j]}
        _{jSU}(p_j + (t'-start_j)) ≤ M
    then UU{i}
    else break
  end for
  RR\U;  SSU
  // Process one token of every job in S in parallel
  for j in S do
    if j has completed d_j tokens then remove from S and record c_j=t
  end for
end while
return the completion times {c_i}.
At each time step tt, up to nn jobs are scanned and for each, an O(M)O(M) memory feasibility check is required, or equivalently an O(U)O(|U|) future-step search. The number of time steps is bounded by di\sum d_i, leading to a worst-case runtime of O(n(di)(n+M))O(n \cdot (\sum d_i) \cdot (n + M)), precluding practical use for nn, MM in the thousands (Wang et al., 8 Aug 2025).

6. Regimes of Failure and Algorithmic Remedies

MC-SF fails when prompt sizes pip_i vary significantly compared to decode lengths did_i, as simply sorting by did_i disregards substantial KV-cache occupancy by jobs with large pip_i. The aforementioned lower bound exploits such heterogeneity. The principal remedies, as introduced in (Wang et al., 8 Aug 2025), include:

  1. Batch quality metric F(X)F(X): For any batch XX, define

F(X)=iXdiX2,F(X) = \frac{\sum_{i \in X} d_i}{|X|^2},

and at each time step, select the subset XRX \subseteq R minimizing F(X)F(X) (subject to memory capacity). This “Sorted-F” rule balances prompt size and decode length.

  1. Constant competitive ratio: Sorted-F achieves a constant CR (48\leq 48), independent of MM.
  2. Accelerated approximations: Further speedups are enabled by dynamic programming, local search, quantile-greedy strategies, and LP-based heuristics.

The incorporation of both pip_i and did_i in selection resolves MC-SF’s vulnerability, particularly under non-uniform prompt distributions (Wang et al., 8 Aug 2025).

7. Context and Significance

MC-SF formalizes a practical but fundamentally limited class of greedy scheduling policies for modern LLM service systems subject to severe KV-cache constraints. While effective for jobs with uniform or near-uniform prompt sizes, it demonstrates provable inefficiency under prompt heterogeneity, as established both analytically and with explicit lower bounds. These findings have prompted the adoption of more sophisticated metrics and batch selection strategies that incorporate both prefill and decode characteristics. The Sorted-F rule, as well as dynamically informed batch optimization, now represent the state-of-the-art in this scheduling regime, substantially outperforming MC-SF and related heuristics while retaining computational practicality (Wang et al., 8 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Memory-Constrained Shortest-First Scheduling Algorithm (MC-SF).