- The paper introduces a deployment configuration optimization using lightweight profiling and analytical estimation to maximize throughput.
- The proposed load-aware scheduler dynamically assigns requests based on compute capacity and KV cache usage, reducing workload imbalances.
- Empirical results show throughput improvements up to 122.5% over traditional round-robin scheduling, validating the new methods.
High-Throughput LLM Inference on Heterogeneous Clusters: Technical Summary
Motivation and Problem Statement
The paper "High-Throughput LLM inference on Heterogeneous Clusters" (2504.15303) addresses the operational bottlenecks and configuration challenges associated with serving LLMs across clusters containing heterogeneous AI accelerators, including GPUs and NPUs. The deployment of LLMs in real-world environments often faces two core difficulties: (1) optimizing resource allocation to match the diverse computational capacities of various nodes, and (2) balancing request scheduling to fully utilize the available hardware, thus preventing resource underutilization and workload imbalances. This investigation is grounded in maximizing throughput (tokens processed per second), an essential metric for cost-efficient inference at scale.
Deployment Configuration Optimization
A key technical contribution is their formulation of a deployment configuration optimization problem, wherein each machine’s tensor parallelism degree is determined with respect to GPU count, memory capacity, and model footprint. The process avoids costly full throughput benchmark sweeps by instead leveraging lightweight profiling and analytical throughput estimation. The model considers both prefill and decode phases of LLM inference, capturing their respective compute-bound and bandwidth-bound characteristics. Parameters affecting token generation latency are fitted using least squares over sampled batches, and these fitted parameters inform configuration search.
For instance s deployed on machine i, the available memory is carefully constrained to host both the LLM weights and the required KV cache for at least one inference request, formalized as:
KVSize(s)≥2⋅l⋅h⋅(Imax​+Omax​)⋅bbyte
where l is the number of layers, h is the hidden size, Imax​ and Omax​ are maximal input and output lengths, and bbyte is the byte size per parameter.
Through systematic analysis of all feasible deployment configurations—each defined by a specific assignment of tensor parallelism and instance count—the paper demonstrates that optimal throughput is configuration-dependent and cannot be trivially inferred from hardware specs alone. Notably, their estimator, although based on static batching, yields consistent configuration rankings even when state-of-the-art dynamic batching engines (e.g., vllm) are used for ground-truth throughput measurement.
Load-Aware Request Scheduling
The paper introduces a novel scheduler designed for runtime request mapping in heterogeneous platforms. Unlike conventional round-robin policies, which induce severe workload fragmentation and bottlenecks when instance capacities diverge, the proposed scheduler incorporates both fitted compute parameters and real-time KV cache usage. Each incoming request is assigned to an instance by minimizing the maximum instance workload, calculated using:
wrs​=Trs​⋅eθ⋅kvusage(s)
where Trs​ is the estimated processing time (prefill and decode) for request r on instance s, and kvusage(s) captures memory utilization. The parameter θ tunes the dynamic sensitivity to overload.
The workload calculator leverages output length predictors (distributional or LLM-based) to anticipate resource demands. The mapper maintains workload statistics, and each request assignment triggers atomic updates to ensure consistency in load tracking; completion hooks guarantee robust subtraction of finished workloads.
Empirical Results
Experiments with Meta-Llama-3-8B and DeepSeek-R1-Distill-Qwen-14B models deployed on clusters containing V100 and A800 GPUs substantiate the effectiveness of both configuration and request scheduling algorithms:
- The throughput ranking for deployment configurations is invariant under both estimator and actual engine measurements, validating profile-driven configuration optimization.
- The scheduler outperforms round-robin by up to 122.5% improvement in throughput in two-instance scenarios with high resource disparity. Weighted round-robin only matches performance when weights are heuristically matched to hardware ratios, a difficult task in diverse clusters.
- On a 2-machine cluster (4×V100, 1×A800), the scheduler yields a 33.6% throughput increase as compared to round-robin, indicating strong scalability and generalization.
Instance completion times under load-aware scheduling exhibit dramatically reduced variance, confirming that load balancing directly translates to improved cluster utilization and throughput.
Implications and Future Perspectives
The results indicate that throughput-optimal LLM inference in heterogenous clusters is dominated by configuration and scheduling strategies that directly account for compute capacity and memory bandwidth heterogeneity. Profile-driven deployment avoids brute-force benchmarking, while adaptive scheduling mitigates bottlenecks induced by naive load partitioning. These design principles generalize to large, diverse industrial clusters and offer a blueprint for cost-efficient, scalable model serving.
Ongoing research should consider (a) further generalizing the instance deployment to pipeline-parallel and cross-node settings, (b) integrating advanced output length prediction via LLMs or sequence modeling, and (c) extending the workload metric to incorporate network communication and latency constraints. The interplay between memory allocation, batching strategy, and real-time predictive scheduling will remain a central axis for LLM serving optimization.
Conclusion
The paper systematically develops a deployment and runtime scheduling architecture for high-throughput LLM inference on heterogeneous clusters. Lightweight profiling coupled with analytic throughput estimation enables efficient configuration search. An adaptive, load-aware scheduler delivers robust improvements in both throughput and workload balance, empirically validated on mixed GPU environments. This work establishes foundational algorithms for future optimizations in heterogeneous LLM inference serving (2504.15303).