Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference (2508.19559v1)

Published 27 Aug 2025 in cs.DC and cs.AI

Abstract: Serving LLMs is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between prefill and decode stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale prefill and decode pools, HeteroScale maintains architectural balance while ensuring efficient, adaptive resource management. Deployed in a massive production environment on tens of thousands of GPUs, HeteroScale has proven its effectiveness, increasing average GPU utilization by a significant 26.6 percentage points and saving hundreds of thousands of GPU-hours daily, all while upholding stringent service level objectives.

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (11)

Tweets

https://twitter.com/pstAsiatech/status/1962502639058407465

alphaXiv

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference (23 likes, 0 questions)