Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference (2508.19559v1)

Published 27 Aug 2025 in cs.DC and cs.AI

Abstract: Serving LLMs is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between prefill and decode stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale prefill and decode pools, HeteroScale maintains architectural balance while ensuring efficient, adaptive resource management. Deployed in a massive production environment on tens of thousands of GPUs, HeteroScale has proven its effectiveness, increasing average GPU utilization by a significant 26.6 percentage points and saving hundreds of thousands of GPU-hours daily, all while upholding stringent service level objectives.

Summary

The paper introduces HeteroScale, a coordinated autoscaling framework for disaggregated LLM inference that dynamically adjusts resource allocation via topology-aware scheduling and key metrics.
It employs a topology-aware scheduler and metrics-driven policy based on TPS, SM activity, and GPU utilization to optimize performance across heterogeneous hardware.
In production, HeteroScale increased average GPU utilization by 26.6 percentage points and significantly reduced GPU-hours, demonstrating enhanced operational efficiency.

Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference

Introduction

The paper presents HeteroScale, an innovative autoscaling framework tailored for serving LLMs in Prefill-Decode (P/D) disaggregated architectures. Traditional autoscalers face challenges in such settings due to inefficient resource usage on heterogeneous hardware, network bottlenecks, and imbalances between prefill and decode stages. HeteroScale addresses these challenges through a coordinated autoscaling approach that adapts to the specific needs of P/D disaggregation, optimizing both resource utilization and operational costs.

HeteroScale delivers substantial improvements by employing a topology-aware scheduler and a metrics-driven policy born from extensive empirical studies of autoscaling signals in a production environment using tens of thousands of GPUs. The outcome includes a 26.6 percentage point increase in average GPU utilization and significant savings in GPU-hours.

Figure 1: HeteroScale System Architecture.

Background and Motivation

P/D Disaggregated Serving

P/D disaggregation involves separating the compute-intensive prefill phase from the memory-bound decode phase, allowing each to be independently optimized. This separation can lead to enhanced resource utilization but also introduces scaling challenges. Prefill requires powerful compute resources, while decode, being sequential, is less intensive in terms of raw computation but demands high-bandwidth KV cache transfers.

Motivation and Challenges

Deploying in the Seed Serving Platform, HeteroScale confronts several challenges, including ineffective conventional autoscalers, heterogeneous resource management, complex network topology, and metric selection issues. Prefill and decode phases must balance hardware heterogeneity and network affinity constraints to effectively scale capacity and manage resource distribution across different pools.

Figure 2: Topological resource tree illustrating the hierarchical organization of resources from data centers (VDCs) down to individual nodes.

System Design

Architecture and Components

HeteroScale is structured into several layers: the autoscaling layer with a policy engine, federated pre-scheduling, and sub-cluster scheduling layers. The design ensures resource-efficient scaling decisions by examining various operational metrics, aligning resource allocation and scaling decisions with workload requirements.

Scaling Policies

The policy engine is pivotal, facilitating real-time and periodic scaling based on workload conditions. It employs throughput (TPS), SM activity, and GPU utilization as the core metrics. Decoding TPS was identified as a robust signal for guiding scale decisions due to consistent behavior across diverse workloads and hardware configurations.

Topology-Aware Scheduling

Deployment Group abstraction maintains network affinity by strategically co-locating prefill and decode instances, hence minimizing latency from KV cache transfers and optimizing scarce high-performance hardware through RDMA subgroups.

Evaluation

Metrics Analysis

Eight scaling metrics were analyzed under simulated workload conditions. TPS emerged as responsive and reliable, aligning with real-world signal-to-noise ratios and enabling effective autoscaling.

Production Deployment Impact

In production, HeteroScale efficiently scaled resources across thousands of GPUs, raising average GPU utilization and SM activity significantly. Figure 3 highlights performance metrics under TPS-based autoscaling, showcasing improved stability and resource savings.

Figure 3: Production environment: Performance metrics of an open-domain dialogue service with TPS-based autoscaling.

The paper situates HeteroScale within existing LLM serving systems and autoscaling technologies, highlighting its unique contributions in heterogeneous resource management and network-aware scheduling in comparison to other frameworks such as Kubernetes HPA, Autoscale, and various deep learning schedulers.

Conclusion

HeteroScale represents a significant advancement for the deployment of large-scale LLM serving platforms, particularly in P/D disaggregated settings. Its topology-aware and metric-driven innovations offer substantial efficiency improvements while maintaining stringent service-level objectives, exemplifying its capability as a benchmark for future scalable LLM infrastructures. Future work will involve exploring dynamic P/D ratio adaptation and cache-aware scaling strategies to further enhance the autoscaling efficiency and robustness.