MC-SF: Memory-Constrained Shortest-First

Updated 1 March 2026

Memory-Constrained Shortest-First (MC-SF) is a scheduling strategy for LLM serving that accounts for variable prompt and decode lengths while managing finite key-value cache memory.
Empirical analysis shows that MC-SF becomes suboptimal under high prompt-length variability, leading to higher overall latency compared to batch-level approaches.
Batch-oriented methods like Sorted-F, which use a quality metric to form feasible groups, achieve significantly lower latency and constant competitive ratios in practice.

Memory-Constrained Shortest-First (MC-SF) is a scheduling strategy designed for optimizing LLM serving workloads where each request exhibits heterogeneous prompt (prefill) and decode lengths. It operates within a system constrained by a finite-size key-value (KV) cache, with the objective of minimizing the total end-to-end latency (TEL), defined as the sum of completion times across all requests. In this context, the MC-SF approach extends the classical Shortest-First (SF) heuristic to account for instantaneous memory limits, but recent research demonstrates its theoretical and empirical suboptimality in settings with substantial heterogeneity in prompt size (Wang et al., 8 Aug 2025).

1. Problem Formulation

The LLM serving task considers $n$ requests $i=1,\dots,n$ , where each request $i$ has a prefill (prompt) length $p_i$ and a decode length $d_i$ . Beginning service for $i$ requires loading its prompt into the KV cache, consuming $p_i$ units of memory. For each of $d_i$ output tokens generated, the memory occupancy for $i$ increases incrementally, reaching $p_i + d_i$ units upon completion. KV cache capacity is limited by $M$ , and at each discrete scheduling step, at most one token per active request can be processed in parallel, provided the aggregate memory occupancy does not exceed $M$ . All requests are assumed to arrive concurrently at time 0. The objective is:

$\min\;\sum_{i=1}^n C_i$

where $C_i$ is the completion time for request $i$ , i.e., when its final output token is generated.

2. Classical Shortest-First with Memory Constraints

The classical SF heuristic schedules requests with the smallest remaining decode length first, greedily filling each batch while enforcing that total cache occupancy does not exceed $M$ at any future time step. Specifically, for a subset $U$ of requests admitted at time $t$ , memory feasibility implies that for every $t' \in [t,\,t+\max_{i\in U} d_i]$ , the total memory used across existing and newly admitted requests remains at most $M$ .

Theoretical analysis reveals that SF becomes highly suboptimal when prompt lengths vary significantly. For a constructed instance with

Type 1: $p=\sqrt{M}-1$ , $d=1$ , count $=M$
Type 2: $p=1$ , $d=2$ , count $=M^{1.5}$

the SF strategy prioritizes all type 1 jobs, yielding $\operatorname{TEL} \propto M^2$ , whereas an optimal scheduler inverts the order and attains $\operatorname{TEL} \propto M^{1.5}$ . Thus, the competitive ratio of SF in this setting grows as $\Theta(\sqrt M)$ , diverging as $M$ increases (Wang et al., 8 Aug 2025).

3. Sorted-F: Batch-Level Memory-Constrained Scheduling

To overcome these limitations, the Sorted-F algorithm replaces the pointwise "shortest decode" selection of MC-SF with a batch-oriented metric. For each candidate batch $\mathcal X$ , define the quality metric:

$F(\mathcal X) = \frac{\sum_{i\in\mathcal X} d_i}{|\mathcal X|^2}$

which trades off aggregate decode workload against batch size to maximize throughput. Sorted-F iteratively constructs feasible batches minimizing $(F(\mathcal X), -|\mathcal X|)$ under the same memory feasibility constraint. Each batch is then scheduled by $d_i$ , and a global service sequence is constructed.

Execution proceeds online: at each step, as many requests as possible from the sorted batch sequence are admitted under the memory limit, then one token per active request is processed. This batch-level scheduling paradigm bypasses pathological cases where short requests with large memory demand exclude more efficient batched groupings, as demonstrated in MC-SF.

4. Theoretical Guarantees

The main theoretical result for Sorted-F establishes a constant competitive ratio, independent of $M$ and $n$ , in the regime where $p_i, d_i = o(M)$ and $n \gg M$ . Specifically,

$\mathrm{CR}(\text{Sorted-F}) = \sup_{\mathcal I} \frac{\mathrm{TEL}(\text{Sorted-F};\mathcal I)}{\mathrm{TEL}(\text{Optimal};\mathcal I)} < 48.$

This upper bound is derived under the assumption that each job's memory footprint $p_i + d_i \le M$ and both prefill and decode lengths are individually negligible compared to $M$ . The analysis shows that Sorted-F achieves end-to-end latency within a fixed multiplicative factor of the offline optimum for workloads of practical scale (Wang et al., 8 Aug 2025).

5. Practical Algorithmic Variants and Complexity

The batch selection problem in Sorted-F is combinatorial, with naïve construction taking $O(2^n)$ . To address scalability, the following four approximation algorithms are proposed:

Exact dynamic programming: $O(n^2M)$ time, effective for $n\lesssim100$ .
Scaled DP: Quantizes memory into $B$ bins, yields $(1+\epsilon)$ -approximate solutions in $O(nB/\epsilon)$ .
Local swap search: Start with a greedy batch, iteratively swap in and out requests if $F(\cdot)$ improves. Complexity $O(n^2)$ , practical up to $n\approx500$ .
Quantile greedy: Build a “core” batch using quantiles of $p_i+d_i$ and $d_i$ , then greedily add requests by increasing $d_i/(p_i+d_i)$ . Runs in $O(n)$ and scales efficiently past $n=1000$ .

Further, two LP-based schedulers are introduced:

Sorted-LP: LP relaxation yields fractional start times; jobs are scheduled in increasing order of expected start time.
LP-Swap: Sorted-LP ordering refined by local $F$ -metric swaps.

Phase 2 for all algorithms—online execution—has $O(n^2)$ time and $O(n)$ space complexity (Wang et al., 8 Aug 2025).

6. Empirical Analysis

Empirical evaluation was conducted on a workload of 1600 short chat and 400 long summarization requests, totaling $n=2000$ , served by LLaMA-2 70B on dual A100 GPUs (KV-cache $M=16{,}492$ tokens). The following table summarizes average latency (ms) across $n=\{200,\dots,2000\}$ :

Algorithm	200	...	2000
FCFS	85	...	210
MC-SF	68	...	147
LP-Swap	35	...	75
Sorted-F*	32	...	70

(Sorted-F: quantile-greedy or local-swap for large $n$ .)

Findings demonstrate that MC-SF is inferior to FCFS under high prompt-length variability. Sorted-F (swap variant) achieves approximately $30\%$ lower latency than LP-Swap and $50\%$ lower than MC-SF; quantile variants offer near-matching quality at linear (in $n$ ) runtime. These results confirm that batch-level $F$ -metric scheduling substantially outperforms token-level MC-SF and FCFS across heterogeneous, realistic workloads (Wang et al., 8 Aug 2025).

7. Context and Implications

Memory-Constrained Shortest-First (MC-SF) was a natural generalization of conventional Shortest-First to the constrained and heterogenous context of LLM serving. However, both theoretical and empirical results establish that MC-SF is not competitive in the presence of heterogeneity in prompt and decode lengths. The transition from MC-SF to batch-wise scheduling using quality metrics such as $F(\cdot)$ represents a paradigm shift for high-throughput, memory-bound LLM inference. A plausible implication is that future LLM system schedulers for heterogeneous workloads should avoid MC-SF in favor of batch-level, globally optimized approaches such as Sorted-F and its fast approximations, especially when system memory is the primary bottleneck (Wang et al., 8 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

LLM Serving Optimization with Variable Prefill and Decode Lengths (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Memory-Constrained Shortest-First (MC-SF).