Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture (2505.11916v1)

Published 17 May 2025 in cs.DC

Abstract: Existing LLMs serving systems typically employ Prefill-Decode disaggregated architecture to prevent computational interference between the prefill and decode phases. However, real-world LLM serving scenarios often exhibit significant fluctuations in request input/output lengths, causing traditional static prefill/decode node configuration ratio to result in imbalanced computational loads between these two nodes, consequently preventing efficient utilization of computing resources to improve the system's goodput. To address this challenge, we design and implement Arrow, an adaptive scheduler that leverages stateless instances and elastic instance pools to achieve efficient adaptive request and instance scheduling. Arrow dynamically adjusts the number of instances handling prefill and decode tasks based on real-time cluster performance metrics, significantly enhancing the system's capability to handle traffic spikes and load variations. Our evaluation under diverse real-world workloads shows that Arrow achieves up to $5.62 \times$ and $7.78 \times$ higher request serving rates compared to state-of-the-art PD-colocated and PD-disaggregated serving systems respectively.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture (2505.11916v1)

Summary

Follow-up Questions

Authors (9)

Tweets

Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture (2505.11916v1)

Summary

Follow-up Questions

Related Papers

Authors (9)

Tweets