Fluid-Guided Online Scheduling
- Fluid-guided online scheduling is a paradigm that uses deterministic fluid approximations to convert complex, stochastic systems into tractable optimization problems with provable near-optimality.
- It employs threshold-based and index policies that adapt resource allocation in multi-class environments, achieving significant computational speed-ups and performance gains.
- The approach is applied in areas like LLM inference and distributed processing, effectively balancing throughput, latency, and memory constraints.
Fluid-guided online scheduling refers to a class of real-time scheduling and control policies that leverage tractable fluid (deterministic) approximations of underlying stochastic or high-dimensional queueing/network systems to inform, benchmark, or directly guide online decision-making. These methods are increasingly prominent in large-scale, resource-constrained systems such as distributed processing networks, cloud serving infrastructures, LLM inference engines, and multiclass queueing environments. Fluid-guided algorithms exploit deterministic, usually continuous, “fluid” limits to translate high-complexity online scheduling challenges into optimization problems or stylized policy classes that achieve provable near-optimality or significant practical performance gains within system constraints.
1. Mathematical Formulation and the Role of Fluid Models
The foundation of fluid-guided scheduling lies in describing a dynamic processing or service system using a deterministic "fluid" model—a limiting behavior where job/service arrivals and departures become continuous flows rather than discrete events. For example, in a separated continuous linear program (SCLP) for fluid processing networks, the system is modeled by buffer (queue) levels and control (service) rates over a time horizon , subject to resource and dynamical constraints: subject to
where is the control-to-buffer routing matrix, encodes state constraints (such as inventories), captures capacity constraints, and are exogenous data and cost/reward vectors (Shindin et al., 2021).
Fluid models replace stochastic queue evolution with ODEs or integral constraints, yielding tractable deterministic optimization problems. In LLM inference scheduling, the system is similarly abstracted: prompt arrivals of various types enter pipeline stages, resource (e.g., GPU memory) consumption is represented as fluid variables, and constraints are enforced on the instantaneous or time-averaged usage (Ao et al., 15 Apr 2025).
2. Fluid-Guided Online Algorithms and Threshold Policies
A central methodological advance in fluid-guided online scheduling is constructing online decision rules that are explicitly informed by the solution structure or equilibrium of the fluid model. This involves either direct computation of fluid-optimal values, or design of index- or threshold-based policies that closely track the optimal allocation predicted by the fluid system.
In queueing contexts with heterogeneous job classes and service priorities, an archetypal fluid-guided rule maintains a running estimate of system occupancy (e.g., number of queued tolerant jobs), consults a Pareto-efficient "lookup" derived from the fluid trade-off frontier, and implements a randomized or deterministic sub-policy at each state (Chaudhary et al., 2019). For instance, the Pareto-complete scheduling class switches between admission control parameters or blocking probabilities based on threshold exceedance of queue levels, as dictated by fluid limit equations.
In LLM inference systems, the "WAIT" and "Nested WAIT" algorithms are constructed to meet fluid equilibrium thresholds for the number of prompts of each type in each stage, such that scheduling decisions for when to launch processing batches are matched to the deterministic fluid system’s capacity and allocation constraints. These algorithms maintain per-stage prompt counters and trigger batch processing when thresholds—explicitly computed from the fluid solution—are met (Ao et al., 15 Apr 2025).
3. Fluid Limit Analysis, Conservation Laws, and Trade-Off Frontiers
Analysing the fluid limit system often reveals pseudo-conservation laws or aggregated performance relationships that fundamentally constrain achievable trade-offs. For example, in two-class queues (eager and tolerant jobs):
- The tolerant class sees an effective service rate given by the leftover capacity after accounting for blocking and service of the eager class, which is a function only of the stand-alone blocking probability for the eager class.
- Mean sojourn time and blocking probability for the two classes are linked by a “fluid conservation law”: for a given blocking policy, one can precisely locate the achievable pair on a trade-off curve (Chaudhary et al., 2019).
Similarly, for resource-constrained inference serving (LLMs under GPU memory limits), the limiting throughput and memory consumption are computed by balancing average arrival and completion rates through fluid equations; constraints on average memory lead directly to capacity constraints on achievable throughput and latency (Ao et al., 15 Apr 2025).
The Pareto frontier of achievable performance—such as pairs of mean latency and throughput—can be characterized by threshold or mixing policies parameterized to ride this fluid-optimal trade-off.
4. Online Implementation and Receding-Horizon Control
Practical deployment of fluid-guided online scheduling hinges on warm-startable algorithms and event-driven updating schemes. In SCLP-based systems, rolling-horizon control operates as follows:
- Upon arrival of new information (e.g., job burst, capacity change), the current state is truncated to the present time.
- Constraints and inflow data are updated, and the SCLP is re-solved for the remainder of the horizon, typically by continuing the previous solution path—resulting in minimal additional computational overhead (Shindin et al., 2021).
In batch scheduling contexts (such as LLM serving), each scheduling cycle (or batch) is controlled by fluid-derived thresholds; the system only launches batches when expected buffer buildup matches the fluid equilibrium required to maintain optimal throughput and latency. The “Nested WAIT” design adapts to additional uncertainty (e.g., unknown output lengths) by partitioning into nested segments with type-specific thresholds, coupled via their evolution in the fluid model (Ao et al., 15 Apr 2025).
5. Computational and Theoretical Performance
Empirical studies consistently report that fluid-guided online scheduling yields computational advantages and strong optimality guarantees in large-scale or heavy-traffic regimes:
- In large-scale SCLP instances, the revised SCLP-simplex algorithm produces exact optimal values significantly faster than finely discretized LP solvers, with observed speed-ups of to as problem size grows (Shindin et al., 2021).
- The WAIT/Nested WAIT algorithms in LLM inference sustain higher throughput over state-of-the-art online batching baselines (such as vLLM and Sarathi) without exceeding memory limits, with latency remaining comparable or slightly improved (Ao et al., 15 Apr 2025).
- Theoretical results establish that under heavy-traffic scaling, the gap (fluid throughput policy throughput) is or better, and latency/TTFT scales as when fluid threshold inequalities are strictly satisfied.
- Standard policies such as FCFS may incur throughput deficits even when resource constraints are met, underlining the necessity of fluid-informed designs for optimality.
6. Broader Impact, Limitations, and Extensions
Fluid-guided scheduling bridges methodologies from operations research (control, queueing theory, large deviation analysis) and machine learning (resource-efficient model serving, adaptive inference). The approach:
- Enables tractable, provably near-optimal solutions in settings with high-dimensionality and dynamic constraints, avoiding the prohibitive overhead of full stochastic optimization or exhaustive simulation.
- Extends readily to model-predictive and robust/receding horizon control, as well as to hybrid systems combining discrete and continuous elements.
Notable limitations include:
- Reliance on accurate knowledge of key parameters (arrival rates, task sizes) to set thresholds; adaptive thresholding under nonstationary or adversarial workloads remains an open area.
- Extensions to multi-resource, multi-agent, and distributed settings may require new fluid approximations to account for inter-node/memory/pipeline bottlenecks.
- Some classes of time-varying or semi-infinite constraint systems remain beyond current fluid-guided scheduling theory (Shindin et al., 2021).
A plausible implication is that further integration of fluid models with learning-based or data-driven parameter estimation may yield scalable and self-tuning schedulers for highly heterogeneous, unpredictable environments.
7. Representative Applications
Fluid-guided online scheduling has demonstrated efficacy across several domains:
| Domain | Key Challenge | Fluid-Guided Solution (Example) |
|---|---|---|
| Semiconductor fabs | High-dimensional process flow | SCLP-simplex for transitional scheduling |
| Multiclass queues | Blocking vs. delay tradeoff | Pareto-complete threshold policies |
| LLM inference serving | Dynamic memory bottlenecks | WAIT & Nested WAIT for batch scheduling |
For LLM inference, prompt arrivals of mixed-length and variable output are scheduled to maximize throughput and minimize latency subject to strict memory caps by precomputing thresholds via a fluid model and applying them directly to scheduling decisions online (Ao et al., 15 Apr 2025). In classical service networks, fluid-optimal policies inform admission control, server splitting, and resource allocation to balance loss and delay in non-convex trade-off regimes (Chaudhary et al., 2019).
Fluid-guided online scheduling is thus a unifying paradigm for resource-constrained, high-load, and multi-class service environments, offering both rigorous theoretical guarantees and documented empirical impact.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free