Online Scheduling for LLM Inference
- Online scheduling for LLM inference is a dynamic process that batches requests and allocates hardware resources in real time, addressing variable output lengths and memory constraints.
- Adaptive methods like meta-learning, prefix-aware in-context refinement, and hybrid scheduling dynamically tune parameters to achieve significant reductions in latency and improvements in throughput.
- Integrating efficient cache management and dynamic resource allocation yields measurable performance gains, exemplified by up to 96% throughput improvement and substantial latency reduction.
Online scheduling for LLM inference refers to the dynamic, real-time decision process that arranges incoming inference requests for efficient execution on resource-constrained hardware, subject to performance metrics like latency, throughput, memory usage, and Service Level Objectives (SLOs). LLM inference systems face unique bottlenecks: unpredictable output lengths, rapidly growing key-value (KV) cache, high concurrency, heterogeneous workloads, and stringent latency demands in production environments. Modern scheduling frameworks optimize not only queue order and request batching, but also engine parameters, cache reuse, hardware allocation, and algorithmic adaptation to workload dynamics.
1. Problem Formulation and Key Bottlenecks
Online LLM inference scheduling is formalized as solving, at each decision point, both the grouping of requests into inference batches and the configuration of serving parameters (batch size, number of concurrent sequences, scheduler delay, memory allocation). For a workload characterized by prompt arrivals, output-length uncertainty, and a fixed GPU KV-cache budget , the scheduler must select parameters (max-batched-tokens, max-num-seqs, scheduler-delay-factor) to minimize metrics such as end-to-end latency under the constraint (where is the tuning budget), along with concurrency and memory limitations (Wang et al., 11 Jul 2025). This formulation is pervasive across online log parsing (Wang et al., 11 Jul 2025), multi-GPU serving (Li et al., 28 Jul 2025), edge-cloud orchestration (Yang et al., 23 May 2024), and user-facing chatbot deployments (Fu et al., 28 Aug 2024, Tao et al., 25 Sep 2025).
A pivotal challenge is the management of the KV-cache, whose growth is tied to both prompt and output lengths. Under uncertainty in output length, conservative (upper-bound) scheduling can sharply degrade performance, whereas adaptive policies based on lower-bound predictions approach near-optimal throughput (Chen et al., 20 Aug 2025). For high-volume log streams, inference efficiency—not parser accuracy—is the dominant bottleneck (Wang et al., 11 Jul 2025).
2. Algorithmic Approaches to Online Scheduling
2.1 Meta-Learning and Adaptive Tuning
InferLog leverages a hybrid meta-learning pipeline: offline attention-augmented MAML trains a latency-predictor across historical workloads, embedding features such as mean token length, template entropy, and concurrency. Online adaptation uses SMBO (Sequential Model-Based Optimization) warm-started from the meta-model, selecting engine configurations via expected improvement, thereby finding configurations within 2% of the optimal latency after only 15 trials—substantially fewer than vanilla Bayesian or random search (Wang et al., 11 Jul 2025).
2.2 Prefix-Aware In-Context Learning (ICL) Refinement
For log parsing workloads, the PAIR policy maximizes prefix caching by reshaping ICL demonstration sets DS so that their initial examples match cached prefixes, thereby making KV blocks bit-identical and maximizing reuse. Empirically, PAIR boosts cache hit-rate from ~55% to >80%, reducing prefill times by 40–50%—crucial for high-concurrency serving (Wang et al., 11 Jul 2025).
2.3 Resource-Aware Hybrid Scheduling
Hybrid architectures (e.g., NEO, APEX) exploit asymmetric pipelining and load-aware partitioning to split decoding across GPU-resident and CPU-offloaded sub-batches. Each iteration balances compute between GPU linear/attention and CPU attention, maximizing parallel resource usage while respecting memory and compute bounds. APEX employs profiling-informed, dynamic dispatch to maximize CPU–GPU overlap, yielding up to 96% throughput improvement on constrained hardware with no latency penalty (Fan et al., 3 Jun 2025, Jiang et al., 2 Nov 2024).
2.4 Memory and Uncertainty-Constrained Policies
Memory-Constrained Shortest-First (MC-SF) greedily packs as many short-output requests as can fit future KV-cache budget, prioritizing in-flight jobs to clear memory and minimize latency. For unknown output lengths, adaptively robust policies (A_min) schedule based on the lower bound and refine predictions during inference, achieving an competitive ratio for latency, compared to the much worse for upper-bound policies under broad uncertainty (Chen et al., 20 Aug 2025, Jaillet et al., 10 Feb 2025).
Fluid-guided threshold policies WAIT/Nested-WAIT approximate multi-stage batch equilibrium, achieving provable near-optimality under heavy load and memory constraints (Ao et al., 15 Apr 2025).
3. Integration of Cache Management and Prefix Reuse
Prefix reuse and efficient cache management are central to minimizing decoding cost and latency. PAIR and related policies refine prompt construction or batch order to maximize KV reuse across requests. RadixAttention and -LPM algorithms formalize the trade-off between fairness (via FCFS resets) and prefix reuse (via longest-match selection), showing hybrid scheduling strictly improves tail latencies (P99 TTFT reduction by up to 30%) (Dexter et al., 7 Feb 2025).
Adapter caching for LoRA environments (Chameleon) further generalizes this, combining multi-queue token budgets and cache residency scores to prevent head-of-line blocking and starvation, yielding up to 80.7% P99 TTFT reduction and 1.5× throughput (Iliakopoulou et al., 24 Nov 2024).
4. Dynamic Scheduling Workflows
A typical online serving workflow arranges incoming requests into a queue, immediately refines prompts for cache reuse, holds requests until scheduler-delay limits are met, batches requests subject to max-token and max-seq constraints, performs prefill with maximal cache reuse, and finally admits newly parsed KV blocks into an eviction-managed cache (Wang et al., 11 Jul 2025).
At the multi-instance or edge-cloud scale, dynamic rescheduling (Llumnix) coordinates live migration of requests and associated KV-cache across model instances to rebalance load, mitigate fragmentation, and prioritize latency-sensitive requests. This context switching lowers P99 prefill latency by up to 15× and yields 36% cost savings in production clusters (Sun et al., 5 Jun 2024).
Cluster-wide two-layer frameworks (SynergySched) employ structurally-informed, online performance models at both engine and cluster routers, predicting batch latency, memory capacity, and cache affinity, to proactively orchestrate routing and batching. This closes information gaps, improving SLO attainment by 43% and delivering up to 3× throughput speedup under high load and heterogeneity (Zhang et al., 27 Sep 2025).
5. Empirical Results and Performance Benchmarks
Across diverse benchmarks and real-world traces, online scheduling methods consistently outperform FCFS and static heuristics:
- InferLog net latency reduction: 43% compared to baseline; throughput increase: 2.14×; prefill time halved (Wang et al., 11 Jul 2025).
- NEO/Hybrid: throughput gains range 14%–750% depending on GPU-class; APEX achieves 49%–96% gain on memory-constrained hardware with preserved end-to-end latency (Jiang et al., 2 Nov 2024, Fan et al., 3 Jun 2025).
- Fluid-guided WAIT/Nested-WAIT: 5%–25% throughput improvements at minor latency cost (Ao et al., 15 Apr 2025).
- Learning-to-rank schedulers (Efficient LLM Scheduling, PARS): 30–40% latency cuts, up to 6.5× throughput increase; cross-model generalization remains strong (Fu et al., 28 Aug 2024, Tao et al., 25 Sep 2025).
- Chameleon: up to 80% improvement in P99 TTFT under many-adapter workloads (Iliakopoulou et al., 24 Nov 2024).
- Llumnix: up to 15× reduction in prefill latency for bursty and heterogeneous workloads (Sun et al., 5 Jun 2024).
- SLS/SmartLLMs: 198% average performance gain, 63% reduction in processing time for multi-model job assignment with adaptive caching and cost-aware scheduling (Liu et al., 5 Aug 2025).
- SLICE (edge): up to 35× increase in SLO attainment, and multi-× speedups in average completion time for heterogeneous real-time workloads (Zhou et al., 21 Oct 2025).
A summary of performance improvements:
| Framework | Tail Latency Reduction | Throughput Increase |
|---|---|---|
| InferLog | 43% (p95 latency) | 2.14× |
| NEO/Hybrid | 14–750% | 1.2–7.5× |
| Chameleon | 80.7% (P99 TTFT) | 1.5–1.9× |
| SynergySched | 43% (SLO attainment) | up to 3× |
| Llumnix | up to 15× (prefill) | up to 36% cost saved |
| SLICE | up to 35× (SLO att.) | 3.4× (completion) |
6. Trade-offs, Fairness, and Future Directions
Trade-offs between latency, throughput, memory usage, and fairness are central. Prefix reuse and large batches can harm tail latency if not counterbalanced by fairness guards (e.g. FCFS resets or starvation-aware promotions). Virtual-time fair queuing (Justitia) achieves provable long-term fairness guarantees, bounding application completion time within a small additive delay of the ideal fluid schedule (Yang et al., 19 Oct 2025).
Multi-objective policies (LeMix) balance throughput, SLO adherence, and model quality in unified scoring models, demonstrating that hierarchical assignment and dynamic adjustment are necessary for shared inference/training workloads in training-serving-unified systems (Li et al., 28 Jul 2025).
Emergent research directions include:
- Integration of reinforcement learning and stochastic programming for scheduler policy adaptation (Pang et al., 14 Feb 2025).
- Proactive, cross-layer orchestration in multi-tenant and heterogeneous clusters (Zhang et al., 27 Sep 2025).
- Information-efficient uncertainty reduction and Bayesian profiling in compound LLM applications (Zhu et al., 4 Apr 2025).
- Dynamic edge-cloud allocation under energy, network, and latency constraints (Yang et al., 23 May 2024, Zhou et al., 21 Oct 2025).
- Cost-aware, performance-driven selection from pools of available LLMs with adaptive cache and feedback mechanisms (Liu et al., 5 Aug 2025).
- Theoretical analysis of fairness and competitive bounds under memory-centric cost models (Yang et al., 19 Oct 2025, Chen et al., 20 Aug 2025).
7. Technical Best Practices and Implementation Insights
- Utilize meta-learning or hybrid optimization for rapid configuration tuning under workload dynamics (Wang et al., 11 Jul 2025).
- Maximize prefix KV-cache reuse (via PAIR, RadixAttention, -LPM) to lower prefill and end-to-end latency for frequent-template workloads (Dexter et al., 7 Feb 2025, Wang et al., 11 Jul 2025).
- Profile hardware and calibrate performance models to enable effective hybrid scheduling and dynamic batching (Jiang et al., 2 Nov 2024, Fan et al., 3 Jun 2025).
- Structure scheduling policies to prioritize in-flight jobs for memory clearance, and batch as many short-output requests as safely permitted (Jaillet et al., 10 Feb 2025).
- Employ adaptive learning-to-rank or neural-predictor-based task sorting to approximate SJF/SRTF in practice (Fu et al., 28 Aug 2024, Tao et al., 25 Sep 2025).
- Protect fairness via virtual-time queuing and starvation-prevention mechanisms, especially in multi-application and multi-tenant setups (Yang et al., 19 Oct 2025).
- When serving compound LLM applications, incorporate uncertainty quantification and Bayesian inference to prioritize uncertainty-reducing stages and optimize job completion time (Zhu et al., 4 Apr 2025).
In sum, online scheduling for LLM inference represents a rapidly evolving intersection of systems, scheduling theory, and ML-optimized resource management. It is foundational for scaling high-concurrency applications and production AI services under strict latency, memory, and QoS constraints, with the synthesis of meta-learning, cache-aware prompt management, and adaptive resource partitioning now demonstrably critical for cutting-edge, real-world deployments.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free