Papers
Topics
Authors
Recent
Search
2000 character limit reached

MC-SF: Memory-Constrained Shortest-First

Updated 1 March 2026
  • Memory-Constrained Shortest-First (MC-SF) is a scheduling strategy for LLM serving that accounts for variable prompt and decode lengths while managing finite key-value cache memory.
  • Empirical analysis shows that MC-SF becomes suboptimal under high prompt-length variability, leading to higher overall latency compared to batch-level approaches.
  • Batch-oriented methods like Sorted-F, which use a quality metric to form feasible groups, achieve significantly lower latency and constant competitive ratios in practice.

Memory-Constrained Shortest-First (MC-SF) is a scheduling strategy designed for optimizing LLM serving workloads where each request exhibits heterogeneous prompt (prefill) and decode lengths. It operates within a system constrained by a finite-size key-value (KV) cache, with the objective of minimizing the total end-to-end latency (TEL), defined as the sum of completion times across all requests. In this context, the MC-SF approach extends the classical Shortest-First (SF) heuristic to account for instantaneous memory limits, but recent research demonstrates its theoretical and empirical suboptimality in settings with substantial heterogeneity in prompt size (Wang et al., 8 Aug 2025).

1. Problem Formulation

The LLM serving task considers nn requests i=1,,ni=1,\dots,n, where each request ii has a prefill (prompt) length pip_i and a decode length did_i. Beginning service for ii requires loading its prompt into the KV cache, consuming pip_i units of memory. For each of did_i output tokens generated, the memory occupancy for ii increases incrementally, reaching pi+dip_i + d_i units upon completion. KV cache capacity is limited by MM, and at each discrete scheduling step, at most one token per active request can be processed in parallel, provided the aggregate memory occupancy does not exceed MM. All requests are assumed to arrive concurrently at time 0. The objective is:

min  i=1nCi\min\;\sum_{i=1}^n C_i

where CiC_i is the completion time for request ii, i.e., when its final output token is generated.

2. Classical Shortest-First with Memory Constraints

The classical SF heuristic schedules requests with the smallest remaining decode length first, greedily filling each batch while enforcing that total cache occupancy does not exceed MM at any future time step. Specifically, for a subset UU of requests admitted at time tt, memory feasibility implies that for every t[t,t+maxiUdi]t' \in [t,\,t+\max_{i\in U} d_i], the total memory used across existing and newly admitted requests remains at most MM.

Theoretical analysis reveals that SF becomes highly suboptimal when prompt lengths vary significantly. For a constructed instance with

  • Type 1: p=M1p=\sqrt{M}-1, d=1d=1, count =M=M
  • Type 2: p=1p=1, d=2d=2, count =M1.5=M^{1.5}

the SF strategy prioritizes all type 1 jobs, yielding TELM2\operatorname{TEL} \propto M^2, whereas an optimal scheduler inverts the order and attains TELM1.5\operatorname{TEL} \propto M^{1.5}. Thus, the competitive ratio of SF in this setting grows as Θ(M)\Theta(\sqrt M), diverging as MM increases (Wang et al., 8 Aug 2025).

3. Sorted-F: Batch-Level Memory-Constrained Scheduling

To overcome these limitations, the Sorted-F algorithm replaces the pointwise "shortest decode" selection of MC-SF with a batch-oriented metric. For each candidate batch X\mathcal X, define the quality metric:

F(X)=iXdiX2F(\mathcal X) = \frac{\sum_{i\in\mathcal X} d_i}{|\mathcal X|^2}

which trades off aggregate decode workload against batch size to maximize throughput. Sorted-F iteratively constructs feasible batches minimizing (F(X),X)(F(\mathcal X), -|\mathcal X|) under the same memory feasibility constraint. Each batch is then scheduled by did_i, and a global service sequence is constructed.

Execution proceeds online: at each step, as many requests as possible from the sorted batch sequence are admitted under the memory limit, then one token per active request is processed. This batch-level scheduling paradigm bypasses pathological cases where short requests with large memory demand exclude more efficient batched groupings, as demonstrated in MC-SF.

4. Theoretical Guarantees

The main theoretical result for Sorted-F establishes a constant competitive ratio, independent of MM and nn, in the regime where pi,di=o(M)p_i, d_i = o(M) and nMn \gg M. Specifically,

CR(Sorted-F)=supITEL(Sorted-F;I)TEL(Optimal;I)<48.\mathrm{CR}(\text{Sorted-F}) = \sup_{\mathcal I} \frac{\mathrm{TEL}(\text{Sorted-F};\mathcal I)}{\mathrm{TEL}(\text{Optimal};\mathcal I)} < 48.

This upper bound is derived under the assumption that each job's memory footprint pi+diMp_i + d_i \le M and both prefill and decode lengths are individually negligible compared to MM. The analysis shows that Sorted-F achieves end-to-end latency within a fixed multiplicative factor of the offline optimum for workloads of practical scale (Wang et al., 8 Aug 2025).

5. Practical Algorithmic Variants and Complexity

The batch selection problem in Sorted-F is combinatorial, with naïve construction taking O(2n)O(2^n). To address scalability, the following four approximation algorithms are proposed:

  • Exact dynamic programming: O(n2M)O(n^2M) time, effective for n100n\lesssim100.
  • Scaled DP: Quantizes memory into BB bins, yields (1+ϵ)(1+\epsilon)-approximate solutions in O(nB/ϵ)O(nB/\epsilon).
  • Local swap search: Start with a greedy batch, iteratively swap in and out requests if F()F(\cdot) improves. Complexity O(n2)O(n^2), practical up to n500n\approx500.
  • Quantile greedy: Build a “core” batch using quantiles of pi+dip_i+d_i and did_i, then greedily add requests by increasing di/(pi+di)d_i/(p_i+d_i). Runs in O(n)O(n) and scales efficiently past n=1000n=1000.

Further, two LP-based schedulers are introduced:

  • Sorted-LP: LP relaxation yields fractional start times; jobs are scheduled in increasing order of expected start time.
  • LP-Swap: Sorted-LP ordering refined by local FF-metric swaps.

Phase 2 for all algorithms—online execution—has O(n2)O(n^2) time and O(n)O(n) space complexity (Wang et al., 8 Aug 2025).

6. Empirical Analysis

Empirical evaluation was conducted on a workload of 1600 short chat and 400 long summarization requests, totaling n=2000n=2000, served by LLaMA-2 70B on dual A100 GPUs (KV-cache M=16,492M=16{,}492 tokens). The following table summarizes average latency (ms) across n={200,,2000}n=\{200,\dots,2000\}:

Algorithm 200 ... 2000
FCFS 85 ... 210
MC-SF 68 ... 147
LP-Swap 35 ... 75
Sorted-F* 32 ... 70

(Sorted-F: quantile-greedy or local-swap for large nn.)

Findings demonstrate that MC-SF is inferior to FCFS under high prompt-length variability. Sorted-F (swap variant) achieves approximately 30%30\% lower latency than LP-Swap and 50%50\% lower than MC-SF; quantile variants offer near-matching quality at linear (in nn) runtime. These results confirm that batch-level FF-metric scheduling substantially outperforms token-level MC-SF and FCFS across heterogeneous, realistic workloads (Wang et al., 8 Aug 2025).

7. Context and Implications

Memory-Constrained Shortest-First (MC-SF) was a natural generalization of conventional Shortest-First to the constrained and heterogenous context of LLM serving. However, both theoretical and empirical results establish that MC-SF is not competitive in the presence of heterogeneity in prompt and decode lengths. The transition from MC-SF to batch-wise scheduling using quality metrics such as F()F(\cdot) represents a paradigm shift for high-throughput, memory-bound LLM inference. A plausible implication is that future LLM system schedulers for heterogeneous workloads should avoid MC-SF in favor of batch-level, globally optimized approaches such as Sorted-F and its fast approximations, especially when system memory is the primary bottleneck (Wang et al., 8 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Memory-Constrained Shortest-First (MC-SF).