Dice Question Streamline Icon: https://streamlinehq.com

Cause of throughput decline with longer outputs in vLLM adapter serving

Determine the exact cause of the observed reduction in maximum throughput and decrease in the number of served adapters as output sequence lengths increase when serving LoRA adapters with the vLLM framework, and ascertain whether preemption arising from vLLM’s greedy memory allocation strategy is responsible for disproportionately affecting longer requests.

Information Square Streamline Icon: https://streamlinehq.com

Background

In the performance analysis of adapter-serving workloads, the authors observe that increasing output sequence lengths leads to reduced maximum throughput and fewer adapters being served at the optimal placement. This effect appears alongside other overheads such as increased memory usage from adapter weights and computational overhead from mixing adapters in batches.

The paper uses vLLM’s online batching with a greedy memory allocation strategy. The authors hypothesize that preemption introduced by this strategy may disproportionately affect longer requests, but they explicitly note that the exact cause of the throughput and placement degradation with longer outputs is not fully understood.

References

While the exact cause of this behavior is not fully understood, we hypothesize that it is related to the preemption coming from the greedy memory allocation strategy of vLLM, which may disproportionately affect longer requests (also apparent in Figure~\ref{fig:performance_analysis-memory_overhead_full}).

Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving (2508.08343 - Agullo et al., 11 Aug 2025) in Section: Performance analysis, Subsection: Optimal placement variability