Eliminating Queue-Level Blocking in LLM Serving with Continuous Batching

Develop scheduling policies for large language model (LLM) inference serving systems that use continuous batching to address queue-level blocking by ensuring that long-running requests do not block shorter requests waiting in the queue, thereby reducing average latency and improving throughput.

Background

Many LLM serving systems use first-come-first-serve scheduling and have adopted continuous batching to prevent long-running requests from blocking entire batches. However, even with continuous batching, short requests can still be delayed by long requests in the waiting queue, creating queue-level blocking that harms latency and throughput.

This problem is identified as an open challenge in the context of LLM serving, motivating the development of scheduling strategies that prevent long-running requests from blocking shorter ones while maintaining system efficiency.

References

However, queue-level blocking remains an open challenge: long-running requests can still block shorter ones waiting in the queue, leading to increased average latency and reduced system throughput.

Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions  (2604.00499 - Zheng et al., 1 Apr 2026) in Related Work — LLM Serving Systems