Eliminating Queue-Level Blocking in LLM Serving with Continuous Batching
Develop scheduling policies for large language model (LLM) inference serving systems that use continuous batching to address queue-level blocking by ensuring that long-running requests do not block shorter requests waiting in the queue, thereby reducing average latency and improving throughput.
References
However, queue-level blocking remains an open challenge: long-running requests can still block shorter ones waiting in the queue, leading to increased average latency and reduced system throughput.
— Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions
(2604.00499 - Zheng et al., 1 Apr 2026) in Related Work — LLM Serving Systems