Including API Delay in Size Estimates for Augmented LLM Scheduling

Determine whether the expected duration of external API calls in augmented large language model (LLM) inference should be included in request size estimates used by size-based scheduling policies, and characterize the impact of each choice on scheduling effectiveness and memory usage.

Background

Augmented LLMs invoke external tools or retrieval during decoding, creating API-related delays and complicating KV cache management. The system must choose among preserve, discard-and-recompute, or swap strategies for KV memory during API calls, each with different latency and memory implications.

Size-based scheduling typically relies on request size estimates, but API-augmented requests have memory and time costs that do not scale proportionally, raising uncertainty about whether API latency should be counted as part of the request size used for scheduling.

References

In this context, it is unclear whether the API delay should be included in the size estimate.

— Queueing, Predictions, and LLMs: Challenges and Open Problems (2503.07545 - Mitzenmacher et al., 10 Mar 2025) in Section 5.1 (Augmented LLMs)

Including API Delay in Size Estimates for Augmented LLM Scheduling

Sponsor

Background

References

Related Problems