WAIT Algorithm for Online Scheduling
- WAIT Algorithm is a family of online scheduling algorithms that optimally manage multi-stage, resource-intensive processes under strict memory constraints.
- It employs threshold-based inventory and batching policies alongside fluid benchmarks to balance throughput, latency, and GPU capacity for LLM inference and laminar scheduling.
- Empirical evaluations show that WAIT and its Nested variant improve throughput and latency, making them effective for high performance in real-world scheduling scenarios.
The WAIT (Waiting for Accumulated Inference Threshold) algorithm is a family of online scheduling algorithms designed to optimize the deployment of resource-intensive, multi-stage processes, with principal applications in LLM inference under memory constraints and, in another context, to active-time scheduling for jobs with laminar (nested) time windows. WAIT leverages threshold-based inventory and batching policies to balance throughput, latency, and resource utilization under stringent online and capacity constraints, and its theoretical analysis confirms near-optimal performance in heavy-traffic regimes (Ao et al., 15 Apr 2025, Cao et al., 2022).
1. Scheduling under Resource and Memory Constraints
The WAIT algorithm was introduced to address the inefficiencies in conventional scheduling when applied to LLM inference workloads on a single GPU node. In this setting, each online prompt triggers an inference composed of a "prefill" (input) and a variable-length "decode" (output) stage, with an associated key-value (KV) cache whose memory footprint grows monotonically until completion and then releases. The system must maintain the invariant that the total KV cache in active memory never exceeds GPU capacity , and the design of an efficient scheduler is complicated by the dual objectives of high throughput (decode tokens per unit time), low average latency, and low time-to-first-token (TTFT).
WAIT conceptualizes scheduling as a multi-stage queueing problem where each incoming request (or its extension, each job in the laminar window regime) transitions across stages, accumulating both progress and resource demand. The goal is to determine when and which subset of jobs should be scheduled so that overall system performance approaches a tractable benchmark derived from a fluid (continuous) relaxation of the underlying stochastic system.
2. Fluid Benchmark and Threshold Design
Central to the WAIT methodology is its fluid benchmark, which approximates the queueing dynamics via a steady-state equilibrium that can be characterized analytically. In the LLM inference setting, this fluid model identifies the optimal batch size and dispatch intervals that maximize throughput while maintaining feasibility under the given memory capacity:
- For request type with Poisson arrival rate , prefill length , and decode length , let be the steady inventory per type.
- The total steady-state memory is given by
- The per-iteration time and throughput are
Thresholds for each request/job type are then chosen to satisfy
These constraints ensure that scheduling decisions are both stable and respect capacity limits, as they enforce that each job is processed sufficiently frequently relative to its arrival rate.
3. Algorithmic Structure: WAIT and Nested WAIT
WAIT for Known Output Lengths
When output (decode) lengths are known in advance, WAIT operates as a threshold-based batching policy:
- Maintain, for each type and each stage , counts .
- Whenever for some , select a batch of at most prompts at each stage, and process this batch, respecting total KV cache capacity .
Simplified pseudocode:
1 2 3 4 5 6 7 |
for t in time: process_incoming_requests() if any(W[j][0] >= n[j]): B = {min(n[j], W[j][k]) for each stage k and type j} process_batch(B) advance_batch_stages(B) free_caches_of_completed_prompts() |
Nested WAIT for Unknown Output Lengths
When output lengths are not known on arrival, the Nested WAIT algorithm introduces a hierarchy of nested segments, creating segments where segment covers cumulative decode stages. Thresholds are defined such that batching is only triggered if the current segment inventory exceeds its threshold. Prompts transition between nested segments as decode progresses.
Pseudocode sketch:
1 2 3 4 5 6 7 8 9 10 11 |
initialize W[i][k] for segments i=1..m and stages k in block_i t = 0 while system_running: wait_for_arrival_or_batch_completion() if arrival: W[1][0] += 1 if end_batch_in_segment_i: advance W[i][k], move or free completed prompts i_star = max{i: forall j<=i, W[j][first_stage] >= n[j]} if i_star > 0: B = union_batches_for_all_segments_up_to_i_star() process_batch(B) |
This design enables WAIT to handle jobs with unknown or distributional output requirements while controlling resource risk using adaptive, segment-wise buffering.
4. Theoretical Guarantees and Heavy Traffic Analysis
WAIT and Nested WAIT are theoretically analyzed in the heavy-traffic regime, which is instantiated by scaling arrival rates and shrinking time constants, yielding simplified asymptotics:
- For WAIT, throughput loss relative to fluid optimal is ; with strict slack, this improves to .
- Latency and TTFT are bounded as (or for strict slack).
- For Nested WAIT, with proper threshold tuning and buffer sizing
and the memory cap is satisfied with probability .
The proofs leverage couplings between the true discrete process and Lindley-type recursions, along with application of Kingman's bound and Doob's inequality to establish control over backlog and overflow.
5. Implementation Complexity and Integration
WAIT and Nested WAIT incur minimal computational overhead in practice:
- Each scheduling event or arrival triggers inventory updates.
- Threshold checks and batch selection per iteration are .
- Inventory is maintained in an array of size at most .
- Memory safety is preserved by holding "waiting" KV caches on GPU but only processing them as scheduled.
- Integration with LLM inference stacks (e.g., vLLM, PyTorch) is direct: the scheduler is called after each completion event to select the next batch, typically with ms overhead versus $50$ ms model computation.
6. Empirical Evaluation
WAIT and Nested WAIT have been empirically benchmarked on Llama2-7B inference using NVIDIA A100 hardware (simulated via Microsoft Vidur) against baselines such as vLLM (FCFS/new-arrival-priority) and Sarathi (ongoing-priority):
| Algorithm | Throughput (k tok/s) | Latency (s) | TTFT (s) |
|---|---|---|---|
| vLLM | 200 (real, QPS 55) | 3.2 | 0.85 |
| Sarathi | 215 | 3.3 | 0.90 |
| WAIT | 365 (synthetic) | 2.7 | 0.80 |
| Nested WAIT | 240 (real, QPS 55) | 2.9 | 0.82 |
Results demonstrate that WAIT increases throughput by up to +9% (synthetic) and Nested WAIT up to +13% (real-world), with modest reductions in average latency and TTFT. For instance, at QPS=550 on LMSYS-Chat workloads, Nested WAIT achieves 505k tok/s versus 420k (vLLM) and 445k (Sarathi), with s.
7. WAIT in Nested Active-Time Scheduling
In addition to LLM inference, the WAIT paradigm appears in the "nested active-time scheduling" problem (Cao et al., 2022), where the objective is to minimize the number of active slots needed to process jobs with preemptible, laminar window constraints on a parallel machine with slot capacity . This formulation also admits a WAIT-like threshold approach, yielding a deterministic $9/5$-approximation computation: an LP is solved for a fractional relaxation, followed by a bottom-up laminar tree decomposition (editor's term: "Decompose") that merges child interval solutions with controlled additive overhead.
The analysis confirms that the rounding cost is tight up to lower-order corrections, and the complexity for both LP solution and decomposition is polynomial in the number of jobs and time slots, making the scheme practical for large-scale settings with nested constraints.
Conclusion
The WAIT and Nested WAIT algorithms provide a threshold-based approach to online, memory-constrained scheduling, offering provably near-optimal trade-offs for throughput, latency, and resource compliance in both LLM inference and laminar active-time scheduling. Their tractable fluid benchmark foundations, amenable heavy-traffic asymptotics, and practical low-overhead implementations enable rigorous, high-performance deployment for real-world large-scale machine learning and scheduling workloads.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free