A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

Published 6 May 2026 in cs.LG, cs.AI, and math.OC | (2605.04595v1)

Abstract: The rapid adoption of LLMs has created significant challenges for efficient inference at scale. Unlike traditional workloads, LLM inference is constrained by both computation and the memory overhead of key-value (KV) caching, which accelerates decoding but quickly exhausts GPU memory. In this paper, we introduce the first queueing-theoretic framework that explicitly incorporates both computation and GPU memory constraints into the analysis of LLM inference. Based on this framework, we derive rigorous stability and instability conditions that determine whether an LLM inference service can sustain incoming demand without unbounded queue growth. This result offers a powerful tool for system deployment, potentially addressing the core challenge of GPU provisioning. By combining an estimated request arrival rate with our derived stable service rate, operators can calculate the necessary cluster size to avoid both costly over-purchasing and performance-violating under-provisioning. We further validate our theoretical predictions through extensive experiments in real GPU production environments. Our results show that the predicted stability conditions are highly accurate, with deviations typically within 10%.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a mathematical framework to analyze LLM inference stability under KV cache constraints using discrete-time Markov chains.
It derives critical processing rates and stability conditions based on memory growth profiles, offering guidelines for GPU provisioning and scheduling.
Empirical validations on Meta-Llama and multi-GPU setups confirm close alignment between theoretical predictions and practical performance.

Queueing-Theoretic Stability Analysis for LLM Inference with KV Cache Constraints

Introduction

The paper "A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints" (2605.04595) addresses the operational fundamentals of LLM inference from a queueing theory perspective, explicitly incorporating GPU memory limitations arising from key-value (KV) caching. Unlike classical ML inference workloads, LLMs exhibit pronounced memory intensiveness due to per-request KV cache growth—prompt lengths and sequential token generation amplify the demand on GPU resources. The framework proposes rigorous stability and instability conditions for LLM inference systems, characterized as discrete-time Markov chains, with a focus on practical GPU provisioning, scheduling, and scaling for sustained operation under stochastic request arrival.

Modeling LLM Inference with KV Cache Constraints

The model represents a single GPU worker constrained by a KV cache memory threshold $M$ (tokens, not bytes), serving prompt requests that arrive stochastically. Each request $i$ possesses a prompt size $s_i$ and an output length $o_i$ , sampled independently from a joint distribution $p(s,o)$ . LLM inference is decomposed into two phases:

Prompt Phase: The input is chunked ( $\hat{s}$ tokens per chunk), processed sequentially, with memory accumulation proportional to chunk index.
Decode Phase: Tokens are generated sequentially. Each new token increases memory by one unit, maxing out at $s_i + o_i$ before KV cache release.

Batching constraints enforce that aggregate memory usage at any time does not exceed $M$ . Scheduling is work-conserving; FCFS and SJF are permissible. The processing time is normalized per batch, enabling discrete-slot analysis with arrival rate $\lambda$ (requests/slot) and mean batch duration $\bar{b}$ .

Rigorous Stability and Instability Conditions

The system's stability is formalized through Markov chain positive recurrence. The critical processing rate $i$ 0 is derived as:

$i$ 1

where $i$ 2 quantifies total memory consumed during both prompt and decode phases. Stability requires $i$ 3, with $i$ 4 representing headroom for the largest single request. Conversely, for $i$ 5, backlog grows unbounded and the service is deemed overloaded. A Lyapunov function is constructed over total outstanding memory demand, proving strict negative drift outside a compact set under work-conserving policies.

Empirical Validation and Numerical Results

Validation is conducted across single- and multi-GPU setups using Meta-Llama-3-8B and vLLM, over diverse joint prompt/decode length regimes. Theoretical predictions for $i$ 6 align closely with empirical GPU service rates, with Gap Absolute Percentage (GAP) error <10% even under time-varying distributions and heavy-tailed workload profiles.

The Cumulative Distribution Function of batch processing times under uniform 1:1 PD ratio illustrates the near-constant per-batch cost in real deployment:

Figure 1: Cumulative Distribution Function for batch execution time with prefill/decode ratio 1:1 requests.

Queue growth dynamics under variable arrival rates are visualized: system is stable for $i$ 7, exhibiting bounded queue sizes; for $i$ 8, queue length grows linearly, confirming overload.

Figure 2: Number of waiting requests in the queue during the time horizon; left: overloaded ( $i$ 9), right: stable ( $s_i$ 0).

Marginal and joint probability density functions from the LongBench v2 dataset demonstrate the importance of modeling real input-output distributions in capacity planning.

Figure 3: Marginal probability density function of prefill length ( $s_i$ 1) from LongBench v2.

Multi-GPU cluster experiments corroborate the theoretical approach's scalability. For eight parallel GPUs, predicted aggregate service rates closely match empirical measurements, with a 3.38% gap, under uniform load balancing.

Practical and Theoretical Implications

The presented framework enables quantitative GPU provisioning: given $s_i$ 2 and system $s_i$ 3, deployment is guided by $s_i$ 4 for target utilization $s_i$ 5. This ensures avoidance of both over-provisioning (idle resources) and under-provisioning (queue instability). The robust convergence of theoretical/empirical rates across various scheduling, memory, and workload regimes supports practical adoption in production LLM inference service architecture.

Architecturally, the framework generalizes to tensor parallelism (by logical worker abstraction) but indicates further research is required for pipeline parallelism and decode-prompt disaggregation, where tandem or networked queues with separate constraints emerge. Modeling heavy-tailed batch processing demands necessitates trimmed mean statistical estimators for $s_i$ 6.

Theoretically, this work aligns LLM inference with stochastic bin packing and queueing networks—extending classical models by capturing the dynamic growth of per-request memory and chunked processing. It offers a closed-form operational stability criterion grounded in the actual memory trajectory per request, filling a notable gap in the literature.

Conclusion

The paper formalizes the queueing-theoretic stability boundary for LLM inference with explicit KV cache constraints, demonstrating high predictive accuracy for practical GPU resource management. The Lyapunov-based approach yields precise guidance for system provisioning and scaling, minimizing latency and maximizing throughput without violating memory limits. Extensions to more complex queueing topologies for advanced LLM serving architectures are a promising avenue for future research.

Markdown Report Issue