Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

WAIT Algorithm for Online Scheduling

Updated 16 November 2025
  • WAIT Algorithm is a family of online scheduling algorithms that optimally manage multi-stage, resource-intensive processes under strict memory constraints.
  • It employs threshold-based inventory and batching policies alongside fluid benchmarks to balance throughput, latency, and GPU capacity for LLM inference and laminar scheduling.
  • Empirical evaluations show that WAIT and its Nested variant improve throughput and latency, making them effective for high performance in real-world scheduling scenarios.

The WAIT (Waiting for Accumulated Inference Threshold) algorithm is a family of online scheduling algorithms designed to optimize the deployment of resource-intensive, multi-stage processes, with principal applications in LLM inference under memory constraints and, in another context, to active-time scheduling for jobs with laminar (nested) time windows. WAIT leverages threshold-based inventory and batching policies to balance throughput, latency, and resource utilization under stringent online and capacity constraints, and its theoretical analysis confirms near-optimal performance in heavy-traffic regimes (Ao et al., 15 Apr 2025, Cao et al., 2022).

1. Scheduling under Resource and Memory Constraints

The WAIT algorithm was introduced to address the inefficiencies in conventional scheduling when applied to LLM inference workloads on a single GPU node. In this setting, each online prompt triggers an inference composed of a "prefill" (input) and a variable-length "decode" (output) stage, with an associated key-value (KV) cache whose memory footprint grows monotonically until completion and then releases. The system must maintain the invariant that the total KV cache in active memory never exceeds GPU capacity CC, and the design of an efficient scheduler is complicated by the dual objectives of high throughput (decode tokens per unit time), low average latency, and low time-to-first-token (TTFT).

WAIT conceptualizes scheduling as a multi-stage queueing problem where each incoming request (or its extension, each job in the laminar window regime) transitions across kk stages, accumulating both progress and resource demand. The goal is to determine when and which subset of jobs should be scheduled so that overall system performance approaches a tractable benchmark derived from a fluid (continuous) relaxation of the underlying stochastic system.

2. Fluid Benchmark and Threshold Design

Central to the WAIT methodology is its fluid benchmark, which approximates the queueing dynamics via a steady-state equilibrium that can be characterized analytically. In the LLM inference setting, this fluid model identifies the optimal batch size and dispatch intervals that maximize throughput while maintaining feasibility under the given memory capacity:

  • For request type jj with Poisson arrival rate λj\lambda_j, prefill length j\ell_j, and decode length j\ell_j', let njn_j^* be the steady inventory per type.
  • The total steady-state memory is given by

M=j=1mnj(j+j/2)M^* = \sum_{j=1}^m n_j^* \cdot (\ell_j + \ell_j'/2)

  • The per-iteration time and throughput are

ΔT=d0+d1M,Tˉ=j=1mλj(j+1)\Delta T^* = d_0 + d_1 M^*\,,\quad \bar{T}^* = \sum_{j=1}^m \lambda_j (\ell_j'+1)

Thresholds njn_j for each request/job type are then chosen to satisfy

ΔT(n1:m)=d0+d1jnj(j+1)(j+j/2)njλj,    j\Delta T(n_{1:m}) = d_0 + d_1 \sum_{j} n_j (\ell_j'+1) (\ell_j+\ell_j'/2) \leq \frac{n_j}{\lambda_j},\;\;\forall j

These constraints ensure that scheduling decisions are both stable and respect capacity limits, as they enforce that each job is processed sufficiently frequently relative to its arrival rate.

3. Algorithmic Structure: WAIT and Nested WAIT

WAIT for Known Output Lengths

When output (decode) lengths are known in advance, WAIT operates as a threshold-based batching policy:

  • Maintain, for each type jj and each stage kk, counts Wj,k(t)W_{j,k}(t).
  • Whenever Wj,0(t)njW_{j,0}(t) \geq n_j for some jj, select a batch of at most njn_j prompts at each stage, and process this batch, respecting total KV cache capacity CC.

Simplified pseudocode:

1
2
3
4
5
6
7
for t in time:
    process_incoming_requests()
    if any(W[j][0] >= n[j]):
        B = {min(n[j], W[j][k]) for each stage k and type j}
        process_batch(B)
        advance_batch_stages(B)
        free_caches_of_completed_prompts()

Nested WAIT for Unknown Output Lengths

When output lengths are not known on arrival, the Nested WAIT algorithm introduces a hierarchy of nested segments, creating mm segments where segment ii covers cumulative decode stages. Thresholds n1>n2>>nmn_1 > n_2 > \cdots > n_m are defined such that batching is only triggered if the current segment inventory exceeds its threshold. Prompts transition between nested segments as decode progresses.

Pseudocode sketch:

1
2
3
4
5
6
7
8
9
10
11
initialize W[i][k] for segments i=1..m and stages k in block_i
t = 0
while system_running:
    wait_for_arrival_or_batch_completion()
    if arrival: W[1][0] += 1
    if end_batch_in_segment_i:
        advance W[i][k], move or free completed prompts
    i_star = max{i: forall j<=i, W[j][first_stage] >= n[j]}
    if i_star > 0:
        B = union_batches_for_all_segments_up_to_i_star()
        process_batch(B)

This design enables WAIT to handle jobs with unknown or distributional output requirements while controlling resource risk using adaptive, segment-wise buffering.

4. Theoretical Guarantees and Heavy Traffic Analysis

WAIT and Nested WAIT are theoretically analyzed in the heavy-traffic regime, which is instantiated by scaling arrival rates and shrinking time constants, yielding simplified asymptotics:

  • For WAIT, throughput loss relative to fluid optimal is O((ζT)1/2)O((\zeta T)^{-1/2}); with strict slack, this improves to O((ζT)1)O((\zeta T)^{-1}).
  • Latency and TTFT are bounded as O((ζT)1/2)O((\zeta T)^{1/2}) (or O(1)O(1) for strict slack).
  • For Nested WAIT, with proper threshold tuning and buffer sizing

TˉE[Throughput(ζ)]=O((ζT)1),E[Latency(ζ)],E[TTFT(ζ)]=O(1)\bar{T}^* - \mathbb{E}[\text{Throughput}^{(\zeta)}] = O((\zeta T)^{-1}),\quad \mathbb{E}[\text{Latency}^{(\zeta)}],\, \mathbb{E}[\text{TTFT}^{(\zeta)}] = O(1)

and the memory cap is satisfied with probability 1δ1-\delta.

The proofs leverage couplings between the true discrete process and Lindley-type recursions, along with application of Kingman's bound and Doob's inequality to establish control over backlog and overflow.

5. Implementation Complexity and Integration

WAIT and Nested WAIT incur minimal computational overhead in practice:

  • Each scheduling event or arrival triggers O(m)O(m) inventory updates.
  • Threshold checks and batch selection per iteration are O(mmax_stages)O(m \cdot \max\_\text{stages}).
  • Inventory is maintained in an array of size at most j(j+1)\sum_j (\ell_j'+1).
  • Memory safety is preserved by holding "waiting" KV caches on GPU but only processing them as scheduled.
  • Integration with LLM inference stacks (e.g., vLLM, PyTorch) is direct: the scheduler is called after each completion event to select the next batch, typically with 1\leq 1 ms overhead versus $50$ ms model computation.

6. Empirical Evaluation

WAIT and Nested WAIT have been empirically benchmarked on Llama2-7B inference using NVIDIA A100 hardware (simulated via Microsoft Vidur) against baselines such as vLLM (FCFS/new-arrival-priority) and Sarathi (ongoing-priority):

Algorithm Throughput (k tok/s) Latency (s) TTFT (s)
vLLM 200 (real, QPS 55) 3.2 0.85
Sarathi 215 3.3 0.90
WAIT 365 (synthetic) 2.7 0.80
Nested WAIT 240 (real, QPS 55) 2.9 0.82

Results demonstrate that WAIT increases throughput by up to +9% (synthetic) and Nested WAIT up to +13% (real-world), with modest reductions in average latency and TTFT. For instance, at QPS=550 on LMSYS-Chat workloads, Nested WAIT achieves 505k tok/s versus 420k (vLLM) and 445k (Sarathi), with Δlatency+0.6\Delta\text{latency} \approx +0.6 s.

7. WAIT in Nested Active-Time Scheduling

In addition to LLM inference, the WAIT paradigm appears in the "nested active-time scheduling" problem (Cao et al., 2022), where the objective is to minimize the number of active slots needed to process jobs with preemptible, laminar window constraints on a parallel machine with slot capacity gg. This formulation also admits a WAIT-like threshold approach, yielding a deterministic $9/5$-approximation computation: an LP is solved for a fractional relaxation, followed by a bottom-up laminar tree decomposition (editor's term: "Decompose") that merges child interval solutions with controlled additive overhead.

The analysis confirms that the rounding cost is tight up to lower-order corrections, and the complexity for both LP solution and decomposition is polynomial in the number of jobs and time slots, making the scheme practical for large-scale settings with nested constraints.

Conclusion

The WAIT and Nested WAIT algorithms provide a threshold-based approach to online, memory-constrained scheduling, offering provably near-optimal trade-offs for throughput, latency, and resource compliance in both LLM inference and laminar active-time scheduling. Their tractable fluid benchmark foundations, amenable heavy-traffic asymptotics, and practical low-overhead implementations enable rigorous, high-performance deployment for real-world large-scale machine learning and scheduling workloads.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to WAIT Algorithm.