CloudMatrix-Infer: Scalable LLM Inference
- CloudMatrix-Infer is an advanced LLM serving system that uses disaggregated resource pools and expert parallelism for high efficiency.
- It leverages peer-to-peer architecture, unified bus communication, and INT8 quantization to optimize throughput and reduce latency.
- Designed for large parameter and MoE architectures, it enables dynamic resource scaling and real-time load balancing for production-scale inference.
CloudMatrix-Infer is an advanced LLM serving system developed for the Huawei CloudMatrix384 datacenter architecture. It is designed to meet the demands of contemporary LLM inference, particularly for models with large parameter counts, mixture-of-experts (MoE) architectures, and extended context lengths. The system integrates innovations in serving architecture, parallelism strategy, hardware-software co-optimization, and quantization to significantly improve efficiency, scalability, and adaptability under stringent service-level objectives and variable workloads.
1. Peer-to-Peer Serving Architecture
CloudMatrix-Infer employs a peer-to-peer serving architecture that disaggregates the LLM inference workflow into three independently scalable resource clusters:
- Prefill Cluster: Handles prompt processing, building the initial key-value (KV) cache and generating the first token.
- Decode Cluster: Autoregressively generates subsequent tokens, updating and consuming the KV cache.
- Caching Cluster: Provides a distributed memory pool for both context caching (KV cache reuse) and model caching, accessible by all NPUs via the Unified Bus (UB) interconnect.
This disaggregation removes the tight coupling of request scheduling to KV cache block locality typical of previous architectures, enabling:
- Independent scaling of each subsystem in response to workload fluctuations.
- Stateless, uniform scheduling: All NPUs can access the distributed cache pool equally, supporting lightweight load balancing.
- Elastic resource allocation: The system adapts in real time to bursty or heterogeneous workload patterns, ensuring high utilization.
The Unified Bus (UB) provides ultra-high-bandwidth, low-latency all-to-all communication between 384 Ascend 910 NPUs, allowing any processing step or memory access to occur with minimal locality-induced inefficiency.
2. Large-Scale Expert Parallelism and Token Dispatch
CloudMatrix-Infer enables large-scale expert parallelism (EP320) by aligning each of 320 NPU dies in the decode cluster to host one expert. This is essential for high-capacity sparse MoE architectures, such as DeepSeek-R1, which use 256 or more experts.
Efficient "token dispatch" and output gathering for MoE layers—which require sending each token to its Top-K experts and combining their results—are achieved via UB-based communication primitives:
- FusedDispatch: Intelligently quantizes token embeddings to INT8 and uses "AIV-Direct" to write directly into remote expert memory via UB with negligible startup latency.
- FusedCombine: Gathers expert outputs using direct UB writes, double-buffered memory, and pipeline address generation for high throughput.
- Static memory pre-allocation: All communication/workspace buffers are allocated at initialization to eliminate runtime dynamic allocation overhead.
This approach ensures that, even at high expert parallelism degrees (EP320), communication bottlenecks are minimized and decode latency remains low. The interconnect’s characteristics (latency, bandwidth) are comparable to intra-node communication, enabling uniform and efficient scaling.
3. Hardware-Aware Optimizations
CloudMatrix-Infer leverages multiple hardware-specific optimizations targeting the Ascend 910 NPUs and the CloudMatrix384 hardware platform:
- Specialized Fused Operators: Operator fusion (RMSNorm, projections, RoPE) consolidates operations to minimize kernel launch overhead and bandwidth waste.
- Microbatch-Based Pipelining: The decode workload is split into dual interleaved streams (attention and MoE), overlapping computation and communication resources for maximal throughput. Microbatching reduces per-layer latency and maximizes accelerator utilization.
- On-the-Fly Tensor Layout Alignment: KV caches are stored in NPUs’ preferred formats, eliminating the need for conversion and improving memory bandwidth efficiency.
- Component-Level Delegation: Prefill phase operations are mapped to the most efficient hardware (AIC for compute, AIV for vector ops, SDMA for communication).
INT8 Quantization
CloudMatrix-Infer introduces a mixed-precision quantization scheme:
- Most memory/compute-intensive modules (feed-forward, attention) use INT8, while the remainder use BF16/FP32 where necessary.
- Adaptive scaling and block-level clipping ensure minimal quantization error:
- Outlier suppression and error compensation further reduce accuracy degradation.
- INT8 GEMM kernels are optimized to achieve 77.4–82.7% of peak NPU TFLOPS, confirming that matrix multiplication is compute-bound on Ascend 910C.
Quantization does not negatively impact accuracy, with results over 16 benchmarks confirming parity with FP16 models.
4. Performance Metrics and Empirical Results
CloudMatrix-Infer achieves high levels of throughput, low latency, and robust hardware utilization:
Throughput and Latency
Metric | CloudMatrix384 (INT8) |
---|---|
Prefill throughput per NPU (4K) | 6,688 tokens/s |
Prefill tokens/s/TFLOPS | 4.45 |
Decode throughput per NPU (batch 96) | 1,943 tokens/s |
Decode tokens/s/TFLOPS | 1.29 |
Minimum TPOT achievable | 15 ms (538 tokens/s) |
- Prefill throughput can reach 5,655–6,688 tokens/s/NPU, and decode throughput 1,943 tokens/s/NPU for 4K KV cache, batch size 96, with token per output time (TPOT) < 50 ms.
- Microbatch-based pipelining improves prefill throughput by up to 31% and decode latency by ~10%.
Accuracy
- INT8-quantized DeepSeek-R1 matches the accuracy of DeepSeek’s official FP16 API/report within 1 point on all tasks (e.g., MMLU INT8: 90.82 / API: 91.05), as shown in benchmarking tables.
- No significant performance degradation is observed under quantization.
Operator Microbenchmarks
- MLA compute: 65%+ of NPU peak performance (compute-bound).
- MLA memory: 84%+ of peak memory bandwidth in memory-bound context.
- MoE comms (EP256): 152 μs dispatch, 54 GB/s (dispatch), significantly faster than comparable GPU infrastructures.
5. Comparison with Prior LLM Serving Systems
CloudMatrix-Infer distinguishes itself from traditional LLM serving systems in several aspects:
- Peer-to-Peer Scheduling: Disaggregated clusters and decoupled cache location remove the need for affinity-aware or locality-constrained scheduling (a limitation in systems like Dynamo or Mooncake).
- Unified Bus (UB) Communication: UB enables all-to-all, high-bandwidth, low-latency communication, outperforming RDMA-based GPU clusters, notably in large-scale MoE token dispatch.
- Cache and Memory Optimization: The cloud-level unified memory pool significantly reduces DRAM usage (1× model size vs. 8× for local DRAM caching) and increases KV cache hit rates.
- Superior Hardware Utilization: Empirical results exhibit higher tokens/s/TFLOPS for both prefill and decode phases than top-performing GPU-based systems.
- Elasticity and Resource Pooling: Each resource pool (prefill, decode, cache) can be independently provisioned, supporting elasticity not seen in traditional block-based scheduling or batch-centric serving software.
6. Impact and Significance in Scalable LLM Deployment
CloudMatrix-Infer represents a significant step in scalable, efficient, and service-level-aware LLM inference. Its architecture is tailored for variable-length, bursty, and heterogeneous LLM workloads, as seen in production-scale settings. The ability to combine peer-to-peer resource pooling, high-degree expert parallelism, and hardware-aware pipeline and quantization optimizations provides:
- Consistent high throughput and low latency even under heavy or variable system loads.
- Robustness to workload diversity due to stateless and uniform scheduling enabled by the UB fabric.
- Energy and cost efficiency via fine-grained hardware utilization, quantization, and adaptive scheduling.
- New possibilities in model and data hosting via disaggregation and dynamic pooling, with implications for future cloud AI infrastructure designs.
Summary Table: Core Metrics and Innovations in CloudMatrix-Infer
Category | Innovation/Result | Quantitative Value |
---|---|---|
Serving Architecture | Peer-to-peer, resource pool disaggregation | Prefill/Decode/Caching pools |
Expert Parallelism | EP320 over 320 NPUs, UB-based token dispatch | 152 μs dispatch (EP256) |
Throughput (INT8, per NPU) | Prefill: 6,688 tokens/s; Decode: 1,943 tokens/s | TPOT: sub-50ms (batch 96) |
Hardware Utilization | INT8 GEMM: 77–83% of peak NPU TFLOPS | 4.45 tokens/s/TFLOPS |
Cache Efficiency | UB-based cache, 1× model size DRAM | High hit rates, low wastage |
Accuracy | Parity with FP16 (16 benchmarks) | <1 point difference |
This combination of architectural design and empirical performance establishes CloudMatrix-Infer as a leading solution for large-scale, adaptive, and efficient LLM serving on datacenter-class infrastructure.