Cause of throughput decline with longer outputs in vLLM adapter serving
Determine the exact cause of the observed reduction in maximum throughput and decrease in the number of served adapters as output sequence lengths increase when serving LoRA adapters with the vLLM framework, and ascertain whether preemption arising from vLLM’s greedy memory allocation strategy is responsible for disproportionately affecting longer requests.
References
While the exact cause of this behavior is not fully understood, we hypothesize that it is related to the preemption coming from the greedy memory allocation strategy of vLLM, which may disproportionately affect longer requests (also apparent in Figure~\ref{fig:performance_analysis-memory_overhead_full}).
— Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving
(2508.08343 - Agullo et al., 11 Aug 2025) in Section: Performance analysis, Subsection: Optimal placement variability