Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 74 tok/s
Gemini 2.5 Pro 37 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Disaggregated Prefill and Decoding Inference System for Large Language Model Serving on Multi-Vendor GPUs (2509.17542v1)

Published 22 Sep 2025 in cs.DC

Abstract: LLM-based applications have been widely used in various industries, but with the increasing of models size, an efficient LLM inference system is an urgent problem to be solved for service providers. Since the inference system is divided into two stage with different characteristics: Prefill and Decode, the two stage will interfere with each other during the inference process. Toward this end, a P-D disaggregated inference framework is proposed by some researchers. Current research is done on homogeneous GPUs, and lacks deployment solutions based on business scenarios. Compared with homogeneous GPUs, using heterogeneous GPUs to construct inference systems can better improve resource utilization and reduce costs. Even if GPUs from different vendors are used to build inference systems, on the basis of reducing costs, the resource utilization rate can be improved and the dependence on a single vendor can be reduced. Therefore, a P-D disaggreagetd inference system based on heterogeneous GPUs is designed, and the heterogeneous compatible transmission module in the system is designed to address heterogeneous GPU data compatibility issues. Then, a joint optimization algorithm of parallel strategy and instance number allocation is proposed to obtain the deployment solutions. Finally, the experimental results show that the P-D disaggregated inference system can well solve the hybrid inference problem of heterogeneous GPUs from different vendors, and the joint optimization algorithm can obtain the optimal deployment solution.

Summary

  • The paper introduces a disaggregated inference framework that partitions LLM serving into a prefill and decode stage to optimize GPU resource utilization.
  • It demonstrates that deploying heterogeneous GPUs with specialized strengths enhances throughput, reduces latency, and efficiently manages VRAM.
  • A two-step optimization algorithm jointly configures parallel strategies and P-D ratios, validated by comprehensive simulation under varied QPS demands.

Disaggregated Prefill and Decoding Inference System for LLM Serving on Multi-Vendor GPUs

Introduction

The inference system of LLMs is pivotal in achieving rapid and accurate responses in user applications. With the proliferation of model sizes, balancing computational requirements and VRAM consumption becomes increasingly challenging. This paper introduces a novel disaggregated inference framework that leverages heterogeneous GPU resources across multiple vendors to optimize resource utilization and minimize costs. Figure 1

Figure 1: System architecture.

System Design

This paper proposes a P-D disaggregated inference system that effectively partitions the inference process into a prefill stage and a decode stage. Prefill focuses on the generation of the first token, which demands significant computational power, while decode manages the subsequent token generation and is intensive in VRAM usage. By deploying GPUs with specific strengths—as illustrated, GPUs with higher computational power serve the prefill stage, whereas GPUs excelling in memory access capabilities handle the decode stage—the system enhances efficiency and resource utilization. Figure 2

Figure 2: Workflow of P-D disaggregated Inference System.

Heterogeneous Compatible Transmission Module

To facilitate data interchange between diverse GPU architectures, a heterogeneous compatible transmission module is designed. This module tackles the disparity in VRAM management across different GPU vendors. It executes operations such as layout conversion, aligning block sizes, and managing parallel strategies to ensure inter-GPU compatibility. The alignment system aligns parallel strategies between varied GPUs, optimizing data transfer processes critical in asynchronous computing environments. Figure 3

Figure 3: VRAM Management Alignment.

Figure 4

Figure 4: Heterogeneous Parallel Strategy Alignment.

Joint Optimization of Parallel Strategy and P-D Ratio

The optimization of deployment strategies is addressed through a two-step algorithm that calculates ideal configurations for the system's parallel strategies and the ratio of P to D instances. This method maximizes throughput while adhering to constraints on time and VRAM capacity, ensuring efficient handling of high QPS user demands.

Simulator Design

The simulator builds a comprehensive system model that integrates transformer structures, hardware features, and framework optimizations. By simulating various operational scenarios, the simulator aids in refining the joint optimization algorithm, ensuring robust model performance across diverse inference contexts. Figure 5

Figure 5: Simulator Model.

Performance Evaluation

Extensive evaluations demonstrate that the P-D disaggregated inference system significantly enhances computational throughput and reduces latency: Figure 6

Figure 6

Figure 6

Figure 6: Influence of Different Context Lengths.

Figure 7

Figure 7

Figure 7

Figure 7: Influence of P-D Ratio(256+256, QPS2).

Figure 8

Figure 8

Figure 8

Figure 8: Influence of P-D Ratio(1024+1024, QPS3).

Performance metrics indicate the system's superior handling of varied context lengths and its flexibility in adjusting P-D ratios to optimize processing speed and resource utilization.

Influence of Heterogeneous P-D Strategies

Deploying heterogeneous GPU architectures across different vendors distinctly impacts system performance, allowing for bespoke enhancements that accommodate larger-scale demands without the constraints of inner-system homogeneity. Figure 9

Figure 9

Figure 9

Figure 9: Influence of Heterogeneous P-D(512+1024, QPS3).

Figure 10

Figure 10

Figure 10

Figure 10: Influence of Heterogeneous P-D(1024+1024, QPS2).

Conclusion

The paper provides compelling evidence that a heterogeneous P-D disaggregated inference framework offers tangible benefits in LLM deployment scenarios. It effectively leverages diverse GPU architecture capabilities to maximize throughput, reduce latency, and minimize infrastructure costs, paving the way for more efficient and adaptable LLM applications across various industries. Future work will focus on refining transmission latencies and optimizing parallel computing strategies to further enhance system capabilities.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube