Papers
Topics
Authors
Recent
Search
2000 character limit reached

Load-Aware Request Scheduling

Updated 18 February 2026
  • Load-Aware Request Scheduling is a dynamic methodology that assigns and reorders service requests based on real-time load, service-level objectives, and resource constraints.
  • It employs adaptive policies and multi-objective optimizations, such as decentralized auctions and hybrid dispatch strategies, to improve throughput and reduce latency.
  • Practical implementations in cloud, LLM inference, and networked systems have demonstrated significant gains in throughput, latency reduction, and fair resource utilization.

Load-aware request scheduling is the systematic assignment and reordering of service requests—spanning cloud environments, LLM inference clusters, networked systems, and beyond—based on dynamic system load, service-level objectives, and resource constraints. Modern research emphasizes schemes that adapt in real time to fluctuating arrival patterns, heterogeneous request workloads, and intricate resource bottlenecks. The core objective is to maximize throughput and minimize (often tail) latency or deadline violations while optimizing auxiliary metrics such as fairness, resource utilization, and cache affinity.

1. Core Principles and Motivations

Load-aware request scheduling originated in classical queueing theory and distributed systems, but its centrality has intensified with the rise of cloud-scale services and resource-intensive AI inference workloads. The driving principles include:

  • Dynamic Load Sensing: Real-time observation and quantification of load metrics (e.g., pending tokens in LLM engines (Yuan et al., 6 Feb 2026), aggregate queue lengths (Yuan et al., 22 Dec 2025), or CPU/memory utilization (Chhabra et al., 2022)).
  • Adaptive Scheduling Policies: Dynamic modification of request assignments and priorities contingent on current system state (e.g., shifting between cache-affinity and load balance (Yuan et al., 6 Feb 2026), migration in response to hotspots (Yuan et al., 22 Dec 2025), or runtime re-partitioning of request classes (Sidik et al., 29 Jan 2026)).
  • Hybrid and Multi-objective Optimization: Simultaneous optimization for multiple, potentially conflicting goals—such as tight SLO attainment, fairness, throughput, and resource conservation.
  • Scalability and Statelessness: Distributed or stateless algorithms that maintain efficiency as cluster scale and load increase (Da et al., 5 Aug 2025, Lutz et al., 2013), minimizing centralized bottlenecks.

The imperative for load-awareness is particularly acute in systems characterized by high arrival-rate variability, diversified workload structure, and stringent SLOs (e.g., interactive LLM APIs, agentic pipelines (Peng et al., 8 May 2025), or ultra-low-latency networked applications).

2. Load Measurement and System Modeling

Load-aware schedulers exhibit substantial diversity in how load is defined and measured, adapted to distinct problem domains:

System/Domain Load Metric(s) Capacity Constraints
LLM Serving (DualMap, L4, Block, FlowKV) Pending prefill tokens, active KV cache, GPU memory GPU memory, queue length per instance
Cloud Data Centers (DRALB) Weighted sum of normalized resource demands CPU, memory, energy, bandwidth
Networked Systems (ATLAS) Fractional time demand (“persistence”) per node Channel capacity, interference region
FaaS/Serverless (OpenWhisk+SEPT) Predicted or historical function durations #cores, CPU time slots

Load is typically aggregated at instance- or queue-level, taking into account predicted service times, real queue lengths, or multidimensional resource footprints (e.g., token counts, bytes, FLOPS) (Da et al., 5 Aug 2025, Chhabra et al., 2022). Advanced methods extract contextual and workload features per request (e.g., prompt and expected output length for LLMs (Yuan et al., 22 Dec 2025, Da et al., 5 Aug 2025); resource vectors for VM placement (Chhabra et al., 2022)) to enable accurate cost simulation and predictive dispatch.

3. Representative Algorithms and Strategies

Distributed Hashing and Power-of-Two-Choices

Recent LLM serving systems (DualMap (Yuan et al., 6 Feb 2026)) employ dual-hash mappings to assign each request to two candidate instances based on a function of request content (e.g., prompt prefix). Tie-breaking and routing decisions employ TTFT (Time-to-First-Token) estimation and enforce SLO-aware load balancing using the power-of-two-choices principle. This technique ensures, analytically, that maximum load deviation is reduced to O(loglogn)O(\log\log n), compared to O(logn)O(\log n) for single-choice assignment.

Priority and Urgency-Based Queues

Short-job bias and urgency-driven scheduling are realized in FaaS (Żuk et al., 2022), LLM (Sidik et al., 29 Jan 2026), and batch scheduling. Techniques range from predicted duration (SEPT), queue-based partitioning (EWSJF: Effective Workload-based Shortest Job First (Sidik et al., 29 Jan 2026)), adaptive urgency scores (Peng et al., 8 May 2025), and density-weighted scoring (combining request length, wait time, and expected resource consumption). This reduces head-of-line blocking and balances per-queue and per-workload throughput.

Hierarchical and Two-Tier Dispatch

Hierarchical approaches separate global dispatch (e.g., balancing load and hardware suitability (Peng et al., 8 May 2025, Chhabra et al., 2022)) from local execution (adaptive priority/urgency queues). Hexgen-Text2SQL (Peng et al., 8 May 2025) leverages a parameter α\alpha to interpolate between minimization of local queue occupancy and service time, optimized in real time via trace-driven simulation. GoRouting in PROSERVE (Huang et al., 15 Dec 2025) simulates batch gain to assign requests to instances with maximal projected SLO gain, dynamically reserving capacity for high-value arrivals.

Distributed Auctions and Decentralized Adaptation

The REACT auction protocol in ATLAS (Lutz et al., 2013) enables fully distributed, asynchronous channel slot allocation. Nodes exchange minimal metadata (offer and claim bytes), yielding rapid, lexicographic max-min allocations and continuous adaptation to topology and load changes. This principle extends to real-time rebalancing protocols in L4 (Yuan et al., 22 Dec 2025), where decentralized bid–ask procedures balance loads within and across length-specialized pipelines.

Disaggregation and Stage Pipe-lining

In L4 (Yuan et al., 22 Dec 2025), FlowKV (Li et al., 3 Apr 2025), and Staggered Batch Scheduling (Tian et al., 18 Dec 2025), clusters are explicitly partitioned by request or resource characteristics (input length, prefill vs. decode), forming pipelines of length-specialized, role-adaptive, or workload-homogeneous groups. Dynamic programming and periodic runtime refinement determine group boundaries for optimal throughput and minimal kurtosis in latency.

4. SLA and Deadline Awareness

Practical schedulers explicitly model SLOs, integrating request-level deadlines into batch admission (SABER (Chang et al., 24 Jun 2025)), latency prediction (Block (Da et al., 5 Aug 2025)), and token-level scheduling (PROSERVE (Huang et al., 15 Dec 2025)). A common technique involves dynamic estimation of system throughput (as a function of active concurrency), enabling precise admission control to maximize goodput—the count of requests completed within SLA thresholds—and to preemptively reject or defer infeasible arrivals (Chang et al., 24 Jun 2025, Huang et al., 15 Dec 2025). This results in improved SLO attainment under load, e.g., 26% higher goodput and up to 45% lower latency variability in CodeLLM serving (Chang et al., 24 Jun 2025).

5. Analytical Guarantees and Empirical Results

Empirical evaluation across LLM inference clusters, FaaS, and datacenter simulators demonstrates:

These results validate the premise that real-time, load-aware, and often decentralized scheduling outperforms static or heuristics-based dispatch, especially under bursty, resource-heterogeneous, or SLO-stringent workloads.

6. Scalability, Generalization, and Limitations

Load-aware request scheduling frameworks are designed to scale:

  • Distributed stateless schedulers (Block (Da et al., 5 Aug 2025), ATLAS (Lutz et al., 2013)) eliminate global bottlenecks via per-instance prediction and local context.
  • Decentralized rebalancing (L4 (Yuan et al., 22 Dec 2025)) and partitioned queues (EWSJF (Sidik et al., 29 Jan 2026)) ensure responsiveness to workload shifts, hardware failures, or scaling events.
  • Queue- and context-driven approaches extend naturally to multi-objective (cost, energy, fairness) settings, multi-tenant deployments, and heterogeneous devices (e.g., A100, H20, L40 GPUs).

Limitations include sensitivity to workload feature selection (e.g., heavy reliance on prompt length can fail in irregular workloads (Sidik et al., 29 Jan 2026)), meta-optimization timescales (slow adaptation under extreme bursts), and challenges in globally coordinated, multi-model, or cross-node settings.

The theoretical and applied advances in load-aware request scheduling intersect with:

  • Resource allocation in cloud/FaaS (e.g., DRALB (Chhabra et al., 2022), DDLS (Alizadeh et al., 2012)): Use of queue-based, cost-sensitive MPC for coordinated admission and start time selection.
  • MAC channel allocation and wireless scheduling (ATLAS (Lutz et al., 2013)): Lex-max-min fairness and piggybacked control signals realize ultra-fast adaptation.
  • Multi-priority and gain-maximizing systems (PROSERVE (Huang et al., 15 Dec 2025)): Explicit design for weighted priorities, token-level deadline gains, and capacity reservation for high-value tasks.

Future research will further integrate semantic request profiling, end-to-end learning (e.g., GP/Bayesian optimization of meta-parameters (Sidik et al., 29 Jan 2026)), multi-resource and multi-objective criteria, and global–local coordination for multi-tenant, multi-workflow AI services.


References:

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Load-Aware Request Scheduling.