MC-SF: Memory-Constrained Shortest-First
- Memory-Constrained Shortest-First (MC-SF) is a scheduling strategy for LLM serving that accounts for variable prompt and decode lengths while managing finite key-value cache memory.
- Empirical analysis shows that MC-SF becomes suboptimal under high prompt-length variability, leading to higher overall latency compared to batch-level approaches.
- Batch-oriented methods like Sorted-F, which use a quality metric to form feasible groups, achieve significantly lower latency and constant competitive ratios in practice.
Memory-Constrained Shortest-First (MC-SF) is a scheduling strategy designed for optimizing LLM serving workloads where each request exhibits heterogeneous prompt (prefill) and decode lengths. It operates within a system constrained by a finite-size key-value (KV) cache, with the objective of minimizing the total end-to-end latency (TEL), defined as the sum of completion times across all requests. In this context, the MC-SF approach extends the classical Shortest-First (SF) heuristic to account for instantaneous memory limits, but recent research demonstrates its theoretical and empirical suboptimality in settings with substantial heterogeneity in prompt size (Wang et al., 8 Aug 2025).
1. Problem Formulation
The LLM serving task considers requests , where each request has a prefill (prompt) length and a decode length . Beginning service for requires loading its prompt into the KV cache, consuming units of memory. For each of output tokens generated, the memory occupancy for increases incrementally, reaching units upon completion. KV cache capacity is limited by , and at each discrete scheduling step, at most one token per active request can be processed in parallel, provided the aggregate memory occupancy does not exceed . All requests are assumed to arrive concurrently at time 0. The objective is:
where is the completion time for request , i.e., when its final output token is generated.
2. Classical Shortest-First with Memory Constraints
The classical SF heuristic schedules requests with the smallest remaining decode length first, greedily filling each batch while enforcing that total cache occupancy does not exceed at any future time step. Specifically, for a subset of requests admitted at time , memory feasibility implies that for every , the total memory used across existing and newly admitted requests remains at most .
Theoretical analysis reveals that SF becomes highly suboptimal when prompt lengths vary significantly. For a constructed instance with
- Type 1: , , count
- Type 2: , , count
the SF strategy prioritizes all type 1 jobs, yielding , whereas an optimal scheduler inverts the order and attains . Thus, the competitive ratio of SF in this setting grows as , diverging as increases (Wang et al., 8 Aug 2025).
3. Sorted-F: Batch-Level Memory-Constrained Scheduling
To overcome these limitations, the Sorted-F algorithm replaces the pointwise "shortest decode" selection of MC-SF with a batch-oriented metric. For each candidate batch , define the quality metric:
which trades off aggregate decode workload against batch size to maximize throughput. Sorted-F iteratively constructs feasible batches minimizing under the same memory feasibility constraint. Each batch is then scheduled by , and a global service sequence is constructed.
Execution proceeds online: at each step, as many requests as possible from the sorted batch sequence are admitted under the memory limit, then one token per active request is processed. This batch-level scheduling paradigm bypasses pathological cases where short requests with large memory demand exclude more efficient batched groupings, as demonstrated in MC-SF.
4. Theoretical Guarantees
The main theoretical result for Sorted-F establishes a constant competitive ratio, independent of and , in the regime where and . Specifically,
This upper bound is derived under the assumption that each job's memory footprint and both prefill and decode lengths are individually negligible compared to . The analysis shows that Sorted-F achieves end-to-end latency within a fixed multiplicative factor of the offline optimum for workloads of practical scale (Wang et al., 8 Aug 2025).
5. Practical Algorithmic Variants and Complexity
The batch selection problem in Sorted-F is combinatorial, with naïve construction taking . To address scalability, the following four approximation algorithms are proposed:
- Exact dynamic programming: time, effective for .
- Scaled DP: Quantizes memory into bins, yields -approximate solutions in .
- Local swap search: Start with a greedy batch, iteratively swap in and out requests if improves. Complexity , practical up to .
- Quantile greedy: Build a “core” batch using quantiles of and , then greedily add requests by increasing . Runs in and scales efficiently past .
Further, two LP-based schedulers are introduced:
- Sorted-LP: LP relaxation yields fractional start times; jobs are scheduled in increasing order of expected start time.
- LP-Swap: Sorted-LP ordering refined by local -metric swaps.
Phase 2 for all algorithms—online execution—has time and space complexity (Wang et al., 8 Aug 2025).
6. Empirical Analysis
Empirical evaluation was conducted on a workload of 1600 short chat and 400 long summarization requests, totaling , served by LLaMA-2 70B on dual A100 GPUs (KV-cache tokens). The following table summarizes average latency (ms) across :
| Algorithm | 200 | ... | 2000 |
|---|---|---|---|
| FCFS | 85 | ... | 210 |
| MC-SF | 68 | ... | 147 |
| LP-Swap | 35 | ... | 75 |
| Sorted-F* | 32 | ... | 70 |
(Sorted-F: quantile-greedy or local-swap for large .)
Findings demonstrate that MC-SF is inferior to FCFS under high prompt-length variability. Sorted-F (swap variant) achieves approximately lower latency than LP-Swap and lower than MC-SF; quantile variants offer near-matching quality at linear (in ) runtime. These results confirm that batch-level -metric scheduling substantially outperforms token-level MC-SF and FCFS across heterogeneous, realistic workloads (Wang et al., 8 Aug 2025).
7. Context and Implications
Memory-Constrained Shortest-First (MC-SF) was a natural generalization of conventional Shortest-First to the constrained and heterogenous context of LLM serving. However, both theoretical and empirical results establish that MC-SF is not competitive in the presence of heterogeneity in prompt and decode lengths. The transition from MC-SF to batch-wise scheduling using quality metrics such as represents a paradigm shift for high-throughput, memory-bound LLM inference. A plausible implication is that future LLM system schedulers for heterogeneous workloads should avoid MC-SF in favor of batch-level, globally optimized approaches such as Sorted-F and its fast approximations, especially when system memory is the primary bottleneck (Wang et al., 8 Aug 2025).