Ordering Requests Under Prompt Sharing in LLM Serving

Ascertain optimal request-ordering policies for LLM serving systems when prompt segments are shared across requests, balancing batching benefits from shared prefixes against prioritizing small standalone requests to minimize latency and resource underutilization.

Background

Prompt sharing is common in LLM applications, enabling reuse of KV cache computations and potentially reducing prefill cost. Systems such as SGLang and HydraGen exploit shared prefixes to improve throughput, but integrating this reuse into latency-oriented schedulers is nontrivial.

The authors highlight a tension: always prioritizing the smallest request may miss opportunities to batch larger requests with shared contexts, suggesting that the best ordering policy remains unresolved and may need to adapt to workload and system conditions.

References

Shared prompts can reduce the cost of the prefill phase when requests sharing the same context are batched; however, it remains unclear how best to order such requests.

— Queueing, Predictions, and LLMs: Challenges and Open Problems (2503.07545 - Mitzenmacher et al., 10 Mar 2025) in Section 4.2 (Adaptive Scheduling)

Ordering Requests Under Prompt Sharing in LLM Serving

Sponsor

Background

References

Related Problems