Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Online Scheduling for LLM Inference

Updated 18 November 2025
  • Online scheduling for LLM inference is a dynamic process that batches requests and allocates hardware resources in real time, addressing variable output lengths and memory constraints.
  • Adaptive methods like meta-learning, prefix-aware in-context refinement, and hybrid scheduling dynamically tune parameters to achieve significant reductions in latency and improvements in throughput.
  • Integrating efficient cache management and dynamic resource allocation yields measurable performance gains, exemplified by up to 96% throughput improvement and substantial latency reduction.

Online scheduling for LLM inference refers to the dynamic, real-time decision process that arranges incoming inference requests for efficient execution on resource-constrained hardware, subject to performance metrics like latency, throughput, memory usage, and Service Level Objectives (SLOs). LLM inference systems face unique bottlenecks: unpredictable output lengths, rapidly growing key-value (KV) cache, high concurrency, heterogeneous workloads, and stringent latency demands in production environments. Modern scheduling frameworks optimize not only queue order and request batching, but also engine parameters, cache reuse, hardware allocation, and algorithmic adaptation to workload dynamics.

1. Problem Formulation and Key Bottlenecks

Online LLM inference scheduling is formalized as solving, at each decision point, both the grouping of requests into inference batches and the configuration of serving parameters (batch size, number of concurrent sequences, scheduler delay, memory allocation). For a workload characterized by prompt arrivals, output-length uncertainty, and a fixed GPU KV-cache budget MM, the scheduler must select parameters c=(c1,c2,c3)c = (c_1, c_2, c_3) (max-batched-tokens, max-num-seqs, scheduler-delay-factor) to minimize metrics such as end-to-end p95p_{95} latency f(c)f(c) under the constraint k=1Kf(ck)Tbudget\sum_{k=1}^K f(c_k) \leq T_\text{budget} (where KK is the tuning budget), along with concurrency and memory limitations (Wang et al., 11 Jul 2025). This formulation is pervasive across online log parsing (Wang et al., 11 Jul 2025), multi-GPU serving (Li et al., 28 Jul 2025), edge-cloud orchestration (Yang et al., 23 May 2024), and user-facing chatbot deployments (Fu et al., 28 Aug 2024, Tao et al., 25 Sep 2025).

A pivotal challenge is the management of the KV-cache, whose growth is tied to both prompt and output lengths. Under uncertainty in output length, conservative (upper-bound) scheduling can sharply degrade performance, whereas adaptive policies based on lower-bound predictions approach near-optimal throughput (Chen et al., 20 Aug 2025). For high-volume log streams, inference efficiency—not parser accuracy—is the dominant bottleneck (Wang et al., 11 Jul 2025).

2. Algorithmic Approaches to Online Scheduling

2.1 Meta-Learning and Adaptive Tuning

InferLog leverages a hybrid meta-learning pipeline: offline attention-augmented MAML trains a latency-predictor fθf_\theta across historical workloads, embedding features such as mean token length, template entropy, and concurrency. Online adaptation uses SMBO (Sequential Model-Based Optimization) warm-started from the meta-model, selecting engine configurations via expected improvement, thereby finding configurations within 2% of the optimal latency after only 15 trials—substantially fewer than vanilla Bayesian or random search (Wang et al., 11 Jul 2025).

2.2 Prefix-Aware In-Context Learning (ICL) Refinement

For log parsing workloads, the PAIR policy maximizes prefix caching by reshaping ICL demonstration sets DS so that their initial kk examples match cached prefixes, thereby making KV blocks bit-identical and maximizing reuse. Empirically, PAIR boosts cache hit-rate from ~55% to >80%, reducing prefill times by 40–50%—crucial for high-concurrency serving (Wang et al., 11 Jul 2025).

2.3 Resource-Aware Hybrid Scheduling

Hybrid architectures (e.g., NEO, APEX) exploit asymmetric pipelining and load-aware partitioning to split decoding across GPU-resident and CPU-offloaded sub-batches. Each iteration balances compute between GPU linear/attention and CPU attention, maximizing parallel resource usage while respecting memory and compute bounds. APEX employs profiling-informed, dynamic dispatch to maximize CPU–GPU overlap, yielding up to 96% throughput improvement on constrained hardware with no latency penalty (Fan et al., 3 Jun 2025, Jiang et al., 2 Nov 2024).

2.4 Memory and Uncertainty-Constrained Policies

Memory-Constrained Shortest-First (MC-SF) greedily packs as many short-output requests as can fit future KV-cache budget, prioritizing in-flight jobs to clear memory and minimize latency. For unknown output lengths, adaptively robust policies (A_min) schedule based on the lower bound and refine predictions during inference, achieving an O(log(α1))O(\log(\alpha^{-1})) competitive ratio for latency, compared to the much worse (α2)(\alpha^{-2}) for upper-bound policies under broad uncertainty (Chen et al., 20 Aug 2025, Jaillet et al., 10 Feb 2025).

Fluid-guided threshold policies WAIT/Nested-WAIT approximate multi-stage batch equilibrium, achieving provable near-optimality under heavy load and memory constraints (Ao et al., 15 Apr 2025).

3. Integration of Cache Management and Prefix Reuse

Prefix reuse and efficient cache management are central to minimizing decoding cost and latency. PAIR and related policies refine prompt construction or batch order to maximize KV reuse across requests. RadixAttention and kk-LPM algorithms formalize the trade-off between fairness (via FCFS resets) and prefix reuse (via longest-match selection), showing hybrid scheduling strictly improves tail latencies (P99 TTFT reduction by up to 30%) (Dexter et al., 7 Feb 2025).

Adapter caching for LoRA environments (Chameleon) further generalizes this, combining multi-queue token budgets and cache residency scores to prevent head-of-line blocking and starvation, yielding up to 80.7% P99 TTFT reduction and 1.5× throughput (Iliakopoulou et al., 24 Nov 2024).

4. Dynamic Scheduling Workflows

A typical online serving workflow arranges incoming requests into a queue, immediately refines prompts for cache reuse, holds requests until scheduler-delay limits are met, batches requests subject to max-token and max-seq constraints, performs prefill with maximal cache reuse, and finally admits newly parsed KV blocks into an eviction-managed cache (Wang et al., 11 Jul 2025).

At the multi-instance or edge-cloud scale, dynamic rescheduling (Llumnix) coordinates live migration of requests and associated KV-cache across model instances to rebalance load, mitigate fragmentation, and prioritize latency-sensitive requests. This context switching lowers P99 prefill latency by up to 15× and yields 36% cost savings in production clusters (Sun et al., 5 Jun 2024).

Cluster-wide two-layer frameworks (SynergySched) employ structurally-informed, online performance models at both engine and cluster routers, predicting batch latency, memory capacity, and cache affinity, to proactively orchestrate routing and batching. This closes information gaps, improving SLO attainment by 43% and delivering up to 3× throughput speedup under high load and heterogeneity (Zhang et al., 27 Sep 2025).

5. Empirical Results and Performance Benchmarks

Across diverse benchmarks and real-world traces, online scheduling methods consistently outperform FCFS and static heuristics:

  • InferLog net p95p_{95} latency reduction: 43% compared to baseline; throughput increase: 2.14×; prefill time halved (Wang et al., 11 Jul 2025).
  • NEO/Hybrid: throughput gains range 14%–750% depending on GPU-class; APEX achieves 49%–96% gain on memory-constrained hardware with preserved end-to-end latency (Jiang et al., 2 Nov 2024, Fan et al., 3 Jun 2025).
  • Fluid-guided WAIT/Nested-WAIT: 5%–25% throughput improvements at minor latency cost (Ao et al., 15 Apr 2025).
  • Learning-to-rank schedulers (Efficient LLM Scheduling, PARS): 30–40% latency cuts, up to 6.5× throughput increase; cross-model generalization remains strong (Fu et al., 28 Aug 2024, Tao et al., 25 Sep 2025).
  • Chameleon: up to 80% improvement in P99 TTFT under many-adapter workloads (Iliakopoulou et al., 24 Nov 2024).
  • Llumnix: up to 15× reduction in prefill latency for bursty and heterogeneous workloads (Sun et al., 5 Jun 2024).
  • SLS/SmartLLMs: 198% average performance gain, 63% reduction in processing time for multi-model job assignment with adaptive caching and cost-aware scheduling (Liu et al., 5 Aug 2025).
  • SLICE (edge): up to 35× increase in SLO attainment, and multi-× speedups in average completion time for heterogeneous real-time workloads (Zhou et al., 21 Oct 2025).

A summary of performance improvements:

Framework Tail Latency Reduction Throughput Increase
InferLog 43% (p95 latency) 2.14×
NEO/Hybrid 14–750% 1.2–7.5×
Chameleon 80.7% (P99 TTFT) 1.5–1.9×
SynergySched 43% (SLO attainment) up to 3×
Llumnix up to 15× (prefill) up to 36% cost saved
SLICE up to 35× (SLO att.) 3.4× (completion)

6. Trade-offs, Fairness, and Future Directions

Trade-offs between latency, throughput, memory usage, and fairness are central. Prefix reuse and large batches can harm tail latency if not counterbalanced by fairness guards (e.g. FCFS resets or starvation-aware promotions). Virtual-time fair queuing (Justitia) achieves provable long-term fairness guarantees, bounding application completion time within a small additive delay of the ideal fluid schedule (Yang et al., 19 Oct 2025).

Multi-objective policies (LeMix) balance throughput, SLO adherence, and model quality in unified scoring models, demonstrating that hierarchical assignment and dynamic adjustment are necessary for shared inference/training workloads in training-serving-unified systems (Li et al., 28 Jul 2025).

Emergent research directions include:

7. Technical Best Practices and Implementation Insights

In sum, online scheduling for LLM inference represents a rapidly evolving intersection of systems, scheduling theory, and ML-optimized resource management. It is foundational for scaling high-concurrency applications and production AI services under strict latency, memory, and QoS constraints, with the synthesis of meta-learning, cache-aware prompt management, and adaptive resource partitioning now demonstrably critical for cutting-edge, real-world deployments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Online Scheduling for LLM Inference.