- The paper introduces Huawei CloudMatrix384 and CloudMatrix-Infer, achieving scalable and high-performance serving of large language and MoE models.
- It employs a novel peer-to-peer serving architecture and hardware-aware optimizations to decouple prefill, decode, and caching tasks for efficient load balancing.
- Evaluation on the 671B-parameter DeepSeek-R1 model demonstrates significant throughput gains and maintained accuracy using INT8 quantization.
The paper "Serving LLMs on Huawei CloudMatrix384" (2506.12708) introduces Huawei CloudMatrix, a novel AI datacenter architecture, and CloudMatrix-Infer, an optimized LLM serving solution built upon it. The work addresses the growing challenges posed by large-scale LLMs (increasing parameters, MoE architectures, long contexts) for conventional AI infrastructure, which struggles with compute intensity, memory bandwidth, inter-chip communication, and latency requirements under dynamic workloads.
Problem: Existing AI clusters face limitations in scaling communication-intensive parallelism (like Tensor Parallelism and Expert Parallelism) across node boundaries, maintaining high utilization under diverse workloads, enabling converged execution of AI and data tasks, and delivering memory-class storage performance needed for large KV caches and model checkpoints.
Solution (Hardware): The paper presents Huawei CloudMatrix384 as the first production-grade implementation of the CloudMatrix vision. It is a supernode integrating 384 Ascend 910C NPUs and 192 Kunpeng CPUs, interconnected by an ultra-high-bandwidth, low-latency Unified Bus (UB) network. This peer-to-peer UB network allows direct all-to-all communication, enabling dynamic pooling, uniform access, and independent scaling of compute, memory, and network resources. This contrasts with conventional hierarchical designs and is particularly beneficial for communication-heavy MoE workloads and distributed KV cache access. CloudMatrix384 incorporates three network planes: the high-bandwidth UB plane for intra-supernode scale-up, an RDMA plane for inter-supernode scale-out, and a VPC plane for standard datacenter networking and storage access. The Ascend 910C NPU is a dual-die package with significant BF16/FP16/INT8 compute power and high on-package memory bandwidth.
Solution (Serving System): CloudMatrix-Infer is proposed as a comprehensive LLM serving solution for CloudMatrix384, exemplified using the DeepSeek-R1 model. It features a peer-to-peer serving architecture with Prefill-Decode-Caching (PDC) disaggregation. This disaggregates prefill, decode, and caching into independent, scalable resource pools that communicate via high-bandwidth KV cache transfers over the UB network. This design decouples scheduling from data locality, simplifying task scheduling, improving load balancing, and enhancing cache efficiency compared to KV cache-centric architectures.
CloudMatrix-Infer introduces three core innovations:
- Peer-to-Peer Serving Architecture: Decouples prefill (processing input prompt), decode (autoregressive token generation), and caching (historical KV cache and model parameters) into distinct NPU/CPU groups connected via the UB network. This allows any NPU to access cached data from a disaggregated memory pool with uniform bandwidth.
- Large-Scale Expert Parallelism (LEP): Optimized for MoE models, supporting high EP degrees (e.g., EP320) where each NPU die hosts one expert for low decode latency. It leverages the UB network for efficient token dispatch and expert output combination. Novel fused communication operators (FusedDispatch, FusedCombine) exploit AIV-Direct communication, early INT8 quantization, static memory pre-allocation, and pipelined data sending to reduce MoE communication overheads.
- Hardware-Aware Optimizations:
- MLA Optimization: Tailored MLA implementation for Ascend 910C includes fusing operators (MLAProlog, FusedAttention), natively storing KV cache in NZ format for better memory access, and MTP-aware tiling with BSND layout for load balance under variable sequence lengths.
- Microbatch-Based Pipelining: Asymmetric microbatch pipelines for both prefill and decode phases overlap computation and communication. In decode, Stream 0 (Attention path) and Stream 1 (MoE path) execute concurrently. In prefill, heterogeneous units (AICs, AIVs, SDMA) are utilized for specialized task execution and overlap.
- Low-interference Prefill-Decode Transfer: KV cache transfer uses the dedicated RDMA plane. Prefill scheduling and transfer are handled asynchronously by a background thread in the decode scheduler. A model-aware connection grouping scheme balances KV cache transfer load between prefill and decode ranks.
- UB-Driven Distributed Caching (EMS): A disaggregated memory pool built from CPU DRAM across nodes, accessible via the UB network and managed by the Elastic Memory Service (EMS), provides high-performance Context Caching (KV cache reuse) and Model Caching (fast model loading/switching). This offers memory-class performance for cache access, significantly reducing load times and memory footprint compared to local caching or direct storage access.
- INT8 Quantization: A training-free, hierarchical INT8 scheme for weights and activations using a mixed-precision strategy, adaptive scale search, outlier suppression via structural transformation, efficient INT8 GEMM kernels, and block-level clipping/error compensation to maintain accuracy while boosting efficiency.
Evaluation: The system was evaluated on a CloudMatrix384 using the 671B-parameter DeepSeek-R1 model with INT8 quantization.
- Overall Performance:
- Prefill: 6,688 tokens/s per NPU (Perfect EPLB, 4K prompt, 16K tokens/NPU batch), achieving 4.45 tokens/s/TFLOPS, exceeding SGLang on H100 (3.75) and DeepSeek on H800 (3.96). Default configuration yielded 5,655 tokens/s (3.76 tokens/s/TFLOPS).
- Decode: 1,943 tokens/s per NPU (96 batch, 4K KV length, <50 ms TPOT), achieving 1.29 tokens/s/TFLOPS, higher than SGLang on H100 (1.10) and DeepSeek on H800 (0.93/1.17). It sustains 538 tokens/s per NPU under a stricter sub-15 ms TPOT (8 batch).
- Accuracy: INT8 quantization on Ascend 910C maintains accuracy comparable to the official DeepSeek-R1 API across 16 benchmarks spanning English, Code, Math, and Chinese tasks.
- Ablation Study:
- Microbatch Pipeline: Improves decode throughput by 5.8%-9.4% and prefill throughput by 23%-31% by overlapping computation and communication or utilizing heterogeneous cores/SDMA.
- MTP: Enables 6%-49% higher decode throughput (with 70% acceptance rate for 1 speculative token) despite increasing per-layer latency by ~44%, by effectively producing more tokens per iteration.
- Context Caching (EMS): Increases prefill throughput by up to 2.28x (at 90% reuse) and reduces TTFT by up to 59% by reusing KV cache. Accessing the cache via the high-bandwidth UB network improves throughput by up to 1.52x compared to using the slower VPC network.
- Operator Performance: CANN EP on CM384 demonstrates lower latency and higher per-rank bandwidth for MoE Dispatch and Combine operations compared to DeepEP on H800, especially for Combine. CANN MLA shows comparable compute (65.4%) and memory bandwidth (84.1%) utilization to FlashMLA on H800 in compute-intensive and memory-intensive settings, respectively. INT8 GEMM kernels show high compute utilization (77.4%-82.7%) on Ascend 910C, indicating they are compute-bound.
Future Directions:
- Further CloudMatrix evolution: Unifying VPC and RDMA planes, building larger-scale supernodes for better resource allocation efficiency and matching future model scales, and physical disaggregation and pooling of CPU resources.
- Future Serving System Enhancements: Moving towards finer-grained component-level disaggregation (e.g., Attention-Decode, Attention-MoE disaggregation) and hybrid/adaptive deployment strategies that dynamically map microservices to the most suitable heterogeneous hardware resources based on workload characteristics.
In conclusion, the paper demonstrates that the Huawei CloudMatrix architecture and the CloudMatrix-Infer serving solution provide a scalable, high-performance, and efficient platform for serving large-scale LLMs, particularly MoE models, setting a new benchmark for AI infrastructure.