Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 24 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 434 tok/s Pro
Kimi K2 198 tok/s Pro
2000 character limit reached

Adaptively Robust LLM Inference Optimization under Prediction Uncertainty (2508.14544v1)

Published 20 Aug 2025 in cs.LG, cs.AI, and math.OC

Abstract: We study the problem of optimizing LLM inference scheduling to minimize total latency. LLM inference is an online and multi-task service process and also heavily energy consuming by which a pre-trained LLM processes input requests and generates output tokens sequentially. Therefore, it is vital to improve its scheduling efficiency and reduce the power consumption while a great amount of prompt requests are arriving. A key challenge in LLM inference scheduling is that while the prompt length is known upon arrival, the output length, which critically impacts memory usage and processing time, is unknown. To address this uncertainty, we propose algorithms that leverage machine learning to predict output lengths, assuming the prediction provides an interval classification (min-max range) for each request. We first design a conservative algorithm, $\mathcal{A}{\max}$, which schedules requests based on the upper bound of predicted output lengths to prevent memory overflow. However, this approach is overly conservative: as prediction accuracy decreases, performance degrades significantly due to potential overestimation. To overcome this limitation, we propose $\mathcal{A}{\min}$, an adaptive algorithm that initially treats the predicted lower bound as the output length and dynamically refines this estimate during inferencing. We prove that $\mathcal{A}{\min}$ achieves a log-scale competitive ratio. Through numerical simulations, we demonstrate that $\mathcal{A}{\min}$ often performs nearly as well as the hindsight scheduler, highlighting both its efficiency and robustness in practical scenarios. Moreover, $\mathcal{A}_{\min}$ relies solely on the lower bound of the prediction interval--an advantageous design choice since upper bounds on output length are typically more challenging to predict accurately.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces an adaptive algorithm, A_min, that leverages lower-bound estimates to optimize LLM inference latency while adhering to GPU memory limits.
  • It demonstrates that A_min achieves a competitive ratio of O(log(1/α)) and outperforms naive conservative methods, especially under wide prediction intervals.
  • Empirical evaluations on the LMSYS-Chat-1M dataset show that A_min closely approximates hindsight-optimal performance even with significant output uncertainty.

Adaptively Robust LLM Inference Optimization under Prediction Uncertainty

Introduction and Problem Formulation

The paper addresses the problem of online scheduling for LLM inference under output length uncertainty, with the objective of minimizing total end-to-end latency subject to GPU memory (KV cache) constraints. In LLM serving, each request's prompt length is known at arrival, but the output length—directly impacting both memory usage and processing time—is unknown and must be predicted. The scheduling challenge is compounded by the need to batch requests efficiently while avoiding memory overflows and minimizing latency, especially as LLM inference is both compute- and energy-intensive.

The authors formalize the scheduling problem as a semi-online, resource-constrained batch scheduling task. Each request ii has a known prompt size ss and an unknown output length oi[,u]o_i \in [\ell, u], where [,u][\ell, u] is a prediction interval provided by a machine learning model. The system processes requests in batches, with the total memory usage at any time tt constrained by the available KV cache MM. The performance metric is the total end-to-end latency (TEL), defined as the sum of completion times for all requests.

Baseline and Naive Algorithms

The paper first reviews the hindsight-optimal algorithm (H-SF), which assumes perfect knowledge of all oio_i and greedily batches the shortest jobs first, achieving the minimum possible TEL. The competitive ratio (CR) is used to evaluate online algorithms relative to H-SF.

A naive conservative algorithm, Amax\mathcal{A}_{\max}, is introduced, which pessimistically assumes every request has output length uu (the upper bound of the prediction interval). This approach guarantees no memory overflows but is highly conservative, leading to poor memory utilization and increased latency as the prediction interval widens. Theoretical analysis shows that the competitive ratio of Amax\mathcal{A}_{\max} is upper bounded by α1(1+α1)2\frac{\alpha^{-1}(1+\alpha^{-1})}{2} and lower bounded by α1(1+α1/2)2\frac{\alpha^{-1}(1+\alpha^{-1/2})}{2}, where α=/u\alpha = \ell/u. Figure 1

Figure 1: Competitive ratio of Amin\text{A}_{\min} as a function of the upper bound uu, demonstrating logarithmic scaling with uu.

Robust Adaptive Algorithm: Amin\mathcal{A}_{\min}

To address the limitations of Amax\mathcal{A}_{\max}, the authors propose Amin\mathcal{A}_{\min}, an adaptive algorithm that leverages only the lower bound \ell of the prediction interval. Amin\mathcal{A}_{\min} initializes each request's estimated output length to \ell and dynamically refines this estimate as tokens are generated. If a memory overflow is imminent, the algorithm evicts jobs with the smallest accumulated lower bounds, updating their estimates accordingly. Batch formation is always based on the current lower bounds, and the algorithm never relies on the upper bound uu.

Theoretical analysis establishes that Amin\mathcal{A}_{\min} achieves a competitive ratio of O(log(α1))\mathcal{O}(\log(\alpha^{-1})) as MM \to \infty, a significant improvement over Amax\mathcal{A}_{\max}, especially when prediction intervals are wide. The analysis leverages a Rayleigh quotient formulation and matrix spectral analysis to bound the competitive ratio logarithmically in uu.

Distributional Extensions and Algorithm Selection

The paper further analyzes Amin\mathcal{A}_{\min} under specific output length distributions:

  • Two-point distribution: For oi{,u}o_i \in \{\ell, u\}, Amin\mathcal{A}_{\min} achieves a competitive ratio 1.5\leq 1.5, while a specialized promote-\ell policy A\mathcal{A}_\ell can further improve the bound to 1+α2(1α)1 + \frac{\alpha}{2(1-\alpha)} for α<0.38\alpha < 0.38.
  • Geometric and linearly weighted geometric distributions: Empirical and analytical results show that Amin\mathcal{A}_{\min}'s competitive ratio is uniformly bounded by $1.7$ and $1.56$, respectively. Figure 2

    Figure 2: Upper-bound curves for the competitive ratios under D2\mathcal{D}_2; A\mathcal{A}_\ell outperforms Amin\mathcal{A}_{\min} for α<0.38\alpha < 0.38.

    Figure 3

Figure 3

Figure 3: Competitive ratio as a function of parameter qq: (Left) geometric distribution G(p)G(p); (Right) linearly weighted geometric distribution LG(p)LG(p).

These results suggest that algorithm selection can be adapted to the empirical distribution of output lengths, switching between Amin\mathcal{A}_{\min} and A\mathcal{A}_\ell to optimize worst-case guarantees.

Numerical Experiments

The authors conduct extensive experiments on the LMSYS-Chat-1M dataset, simulating LLM inference scheduling under three prediction regimes: rough prediction (wide intervals), non-overlapping classification (bucketed intervals), and individualized overlapping intervals (centered on true oio_i). Across all settings, Amin\mathcal{A}_{\min} consistently matches or closely approaches the performance of the hindsight-optimal scheduler, even when prediction intervals are extremely wide. In contrast, Amax\mathcal{A}_{\max} only performs well when predictions are highly accurate.

Key empirical findings include:

  • Under rough prediction ([1,1000][1,1000]), Amin\mathcal{A}_{\min} achieves average latency nearly identical to H-SF, while Amax\mathcal{A}_{\max} suffers from high latency due to underutilized memory.
  • As prediction accuracy improves (narrower intervals), both algorithms improve, but Amin\mathcal{A}_{\min} remains robust even with poor predictions.
  • For individualized intervals, Amin\mathcal{A}_{\min} maintains low latency across all levels of prediction uncertainty, while Amax\mathcal{A}_{\max} degrades rapidly as uncertainty increases.

Theoretical and Practical Implications

The results demonstrate that robust, adaptive scheduling algorithms can achieve near-optimal LLM inference efficiency even under substantial prediction uncertainty. The key insight is that leveraging only the lower bound of output length predictions, and dynamically updating estimates during execution, is sufficient to guarantee strong performance both in theory and in practice. The avoidance of upper bound reliance is particularly advantageous, as upper bounds are typically harder to predict accurately in real-world systems.

The theoretical framework—combining competitive analysis, memory-preserving combinatorial techniques, and Rayleigh quotient spectral analysis—provides a foundation for future work on robust online scheduling under uncertainty. The distributional analysis and adaptive algorithm selection further suggest that practical LLM serving systems can benefit from workload-aware policy switching.

Conclusion

This work advances the theory and practice of LLM inference scheduling by introducing and analyzing robust algorithms that operate effectively under prediction uncertainty. The adaptive lower-bound-based approach of Amin\mathcal{A}_{\min} achieves provably logarithmic competitive ratios and demonstrates strong empirical performance across a range of realistic scenarios. The results have direct implications for the design of scalable, efficient, and robust LLM serving systems, and open avenues for further research on learning-augmented online algorithms and resource-constrained scheduling in AI systems.

X Twitter Logo Streamline Icon: https://streamlinehq.com