CloudMatrix Datacenter Architecture
- CloudMatrix Datacenter Architecture is a next-generation AI infrastructure that integrates high-density NPUs and CPUs for large language model inference using a co-designed UB fabric.
- It employs a two-tier non-blocking Clos/leaf–spine UB network that offers near-uniform low latency and high bandwidth for efficient all-to-all communication between NPUs and CPUs.
- Dynamic resource pooling and an optimized software stack enable scalable, low-latency LLM serving with advanced techniques like operator fusion and INT8 quantization.
CloudMatrix Datacenter Architecture is a next-generation AI infrastructure designed to address the demands imposed by LLMs, particularly those utilizing massive parameter counts, mixture-of-experts (MoE) topologies, and extended context lengths. These trends challenge traditional cluster architectures, necessitating innovations in compute density, memory bandwidth, and low-latency interconnects. The CloudMatrix384 supernode, realized via co-design of hardware and system software, integrates 384 Ascend 910C neural processing units (NPUs) and 192 Kunpeng CPUs via an ultra-high-bandwidth Unified Bus (UB) network. This ensemble enables direct all-to-all communication, dynamic resource pooling, and high-efficiency operation under variable and stringent service-level objectives (Zuo et al., 15 Jun 2025).
1. Hardware Architecture and System Composition
The CloudMatrix384 supernode is organized across 48 compute nodes, each comprising 8 Ascend 910C NPUs (dual-die, 752 TFLOPS FP16/BF16 per package, 128 GB on-package HBM, 3.2 TB/s memory bandwidth), 4 Kunpeng CPUs (each providing approximately 160 GB/s of UB bandwidth), and seven first-tier UB switches. This composition supports dense compute and memory resources aggregation for both model execution and memory-intensive tasks.
Unified Bus (UB) Network
The UB operates as a two-tier non-blocking Clos/leaf–spine topology. Each on-node UB switch (L1) provides 16 links into an arrangement of seven independent L2 sub-planes housed in four dedicated communication racks. The network achieves:
- Per-NPU-die uplink bandwidth: unidirectional bidirectional)
- Inter-node bandwidth: (98% of intra-node value)
- Hop latencies: ;
The all-to-all direct peer-to-peer routing, without host CPU mediation, yields uniform latency and bandwidth properties across NPU and CPU memory endpoints.
Dynamic Resource Pooling
A global address space enables all NPU and CPU memories to be accessible across the UB fabric, supporting zero-copy DMA and runtime-disaggregated compute and memory resource allocation. Any NPU can bind to any memory page or CPU, enhancing utilization and facilitating fine-grained load balancing.
2. Communication and Performance Modeling
The CloudMatrix UB fabric enables efficient distributed operations essential for MoE LLM inference and serving. For an expert-parallel group with dies:
- Total token dispatch bandwidth:
- Aggregate latency per roundtrip: , with proportional to
MoE Expert-Parallel Communication Complexity
Let be the EP degree, the local batch, the expert dimension (e.g., 7,168), the top experts per token (e.g., 8):
- per token
Communication time per FusedDispatch satisfies:
Empirical results show that at EP8, FusedDispatch and Combine on CloudMatrix UB require 116–118 μs per NPU die (batch=128), compared to 163–318 μs using reference H800 RDMA interfaces.
Distributed KV-Cache Access
For remote load,
with typical block sizes (512 KB), enabling sub-100 μs remote KV cache fetches.
3. CloudMatrix-Infer Software Stack
CloudMatrix-Infer is the software stack co-designed for CloudMatrix384 to optimize LLM serving across prefill, decode, and caching.
Peer-to-Peer PDC Disaggregation
- Prefill Cluster: Implements EP32 per instance (16 NPUs), focusing on context construction and initial KV cache population.
- Decode Cluster: Scales tightly-coupled MoE inference up to EP320 (160 NPUs).
- Caching Cluster: Leverages a disaggregated DRAM pool across 32 CPU nodes for both KV and model cache, accessed uniformly over UB.
KV cache transfers between prefill and decode stages utilize an isolated RDMA plane to prevent decode interference.
Large-Scale Expert Parallelism (EP320) and Pipelining
- FusedDispatch/FusedCombine Operators: AI-vector (AIV) cores directly write to remote NPU memory over UB, mitigating SDMA startup latency. Early quantization (BF16→INT8) reduces message size; pre-allocated double buffers balance dispatch and combine occupancy.
- Microbatch-Based Decode Pipeline: Decoding employs two interleaved streams (attention, MoE paths), each utilizing distinct AIC/AIV core allocations, providing layer-level latency reduction via overlap.
- Multiple-Token Prediction (MTP): Utilizes speculative generation, with on-NPU sampling and aggregation, yielding a 30% throughput gain (acceptance rate ).
Hardware-Aware Optimizations
- Operator Fusion: MLAProlog combines normalization, projection, and positional encoding into a single pass; FusedAttention integrates attention and associated data layout operations.
- Microbatch Pipeline and Hybrid Parallelism: SP–TP–SP hybridization, with microbatch pipelining overlapping compute and communication, maximizing both AIV and bulk network utilization.
- INT8 Quantization: Employs mixed precision (INT8 for GEMMs, BF16/FP32 elsewhere), adaptive scale search, and outlier suppression with zero-training and no fine-tuning.
4. Operator and Microbenchmark Evaluation
CloudMatrix384's UB fabric and operator stack deliver high utilization:
| Operator | Achieved (per NPU) | Peak (per NPU) | Utilization |
|---|---|---|---|
| MLA on Ascend die | 246 TFLOPS | 376 TFLOPS | 65.4% |
| INT8 GEMM (128×152) | 582–622 TFLOPS | 752 TFLOPS | 77–83% |
| Memory BW (MLA) | 1,346 GB/s | 1,600 GB/s | 84.1% |
For FusedDispatch/Combine at EP256, UB achieves 149–152 μs latencies and per-rank bandwidth of 54–103 GB/s, improving significantly over RDMA.
5. LLM Serving Performance
CloudMatrix-Infer attains the following efficiency and latency metrics for serving LLMs (e.g., DeepSeek-R1):
- Prefill phase (4K prompt, EP32): 5,655 tokens/s per NPU (, INT8); best-case EPLB: 6,688 tokens/s ()
- Decode phase (4K cache, batch=96): TPOT=49.4 ms; throughput=1,943 tokens/s per NPU ()
- SLO enforcement: At SLO=30 ms (batch=24): 974 tokens/s, TPOT=24.6 ms; SLO=15 ms (batch=8): 538 tokens/s, TPOT=14.9 ms
Compared to leading GPU-based platforms, CloudMatrix-Infer demonstrates superior per-unit throughput and maintains full model accuracy under INT8 quantization. For example:
| System | Prefill (tok/s) | Prefill (tok/s/TFLOPS) | Decode (tok/s) | Decode (tok/s/TFLOPS) |
|---|---|---|---|---|
| CloudMatrix-Infer | 6,688 | 4.45 | 1,943 | 1.29 |
| SGLang@H100 | 6,288 | 3.18 | ~2,172 | 1.10 |
| DeepSeek@H800 | 4,026 | 2.03 | 1,850 | 0.93 |
6. Architectural Significance and Outlook
CloudMatrix384 demonstrates the practical advantages of co-designing hardware interconnects, dynamic pooling strategies, and highly optimized software stacks for contemporary LLM workloads. The peer-to-peer UB fabric, global address space, and scalable expert parallelism support sustained high throughput, low latency, and efficient memory utilization, even under demanding MoE and distributed KV cache workloads. The system establishes new reference points for AI datacenter design, with demonstrated superior utilization and model-serving efficiency relative to conventional GPU clusters. Further research may explore extended scaling properties, scheduling under more heterogeneous batch and context distributions, and architectural adaptations for emergent model paradigms (Zuo et al., 15 Jun 2025).