Including API Delay in Size Estimates for Augmented LLM Scheduling
Determine whether the expected duration of external API calls in augmented large language model (LLM) inference should be included in request size estimates used by size-based scheduling policies, and characterize the impact of each choice on scheduling effectiveness and memory usage.
References
In this context, it is unclear whether the API delay should be included in the size estimate.
— Queueing, Predictions, and LLMs: Challenges and Open Problems
(2503.07545 - Mitzenmacher et al., 10 Mar 2025) in Section 5.1 (Augmented LLMs)