Papers
Topics
Authors
Recent
Search
2000 character limit reached

CloudMatrix Datacenter Architecture

Updated 18 March 2026
  • CloudMatrix Datacenter Architecture is a next-generation AI infrastructure that integrates high-density NPUs and CPUs for large language model inference using a co-designed UB fabric.
  • It employs a two-tier non-blocking Clos/leaf–spine UB network that offers near-uniform low latency and high bandwidth for efficient all-to-all communication between NPUs and CPUs.
  • Dynamic resource pooling and an optimized software stack enable scalable, low-latency LLM serving with advanced techniques like operator fusion and INT8 quantization.

CloudMatrix Datacenter Architecture is a next-generation AI infrastructure designed to address the demands imposed by LLMs, particularly those utilizing massive parameter counts, mixture-of-experts (MoE) topologies, and extended context lengths. These trends challenge traditional cluster architectures, necessitating innovations in compute density, memory bandwidth, and low-latency interconnects. The CloudMatrix384 supernode, realized via co-design of hardware and system software, integrates 384 Ascend 910C neural processing units (NPUs) and 192 Kunpeng CPUs via an ultra-high-bandwidth Unified Bus (UB) network. This ensemble enables direct all-to-all communication, dynamic resource pooling, and high-efficiency operation under variable and stringent service-level objectives (Zuo et al., 15 Jun 2025).

1. Hardware Architecture and System Composition

The CloudMatrix384 supernode is organized across 48 compute nodes, each comprising 8 Ascend 910C NPUs (dual-die, 752 TFLOPS FP16/BF16 per package, 128 GB on-package HBM, 3.2 TB/s memory bandwidth), 4 Kunpeng CPUs (each providing approximately 160 GB/s of UB bandwidth), and seven first-tier UB switches. This composition supports dense compute and memory resources aggregation for both model execution and memory-intensive tasks.

Unified Bus (UB) Network

The UB operates as a two-tier non-blocking Clos/leaf–spine topology. Each on-node UB switch (L1) provides 16 links into an arrangement of seven independent L2 sub-planes housed in four dedicated communication racks. The network achieves:

  • Per-NPU-die uplink bandwidth: BUB196GB/sB_{UB} \approx 196\,\mathrm{GB/s} unidirectional (392GB/s(392\,\mathrm{GB/s} bidirectional)
  • Inter-node bandwidth: 164GB/s\approx 164\,\mathrm{GB/s} (98% of intra-node value)
  • Hop latencies: Lhop,intra1.2μsL_{hop,\,intra} \approx 1.2\,\mu s; Lhop,inter1.9μsL_{hop,\,inter} \approx 1.9\,\mu s

The all-to-all direct peer-to-peer routing, without host CPU mediation, yields uniform latency and bandwidth properties across NPU and CPU memory endpoints.

Dynamic Resource Pooling

A global address space enables all NPU and CPU memories to be accessible across the UB fabric, supporting zero-copy DMA and runtime-disaggregated compute and memory resource allocation. Any NPU can bind to any memory page or CPU, enhancing utilization and facilitating fine-grained load balancing.

2. Communication and Performance Modeling

The CloudMatrix UB fabric enables efficient distributed operations essential for MoE LLM inference and serving. For an expert-parallel group with NN dies:

  • Total token dispatch bandwidth: N×BUBN \times B_{UB}
  • Aggregate latency per roundtrip: 2Lhop+Tcomm2\cdot L_{hop} + T_{comm}, with TcommT_{comm} proportional to (message size)/BUB(\text{message size})/B_{UB}

MoE Expert-Parallel Communication Complexity

Let NranksN_{ranks} be the EP degree, bb the local batch, DD the expert dimension (e.g., 7,168), KK the top experts per token (e.g., 8):

  • max tokens=b×min(K,experts per die)\text{max tokens} = b \times \min(K, \text{experts per die})
  • msg sizedispatch=max tokens×(D×1  B+overhead)7.5KB\text{msg size}_{dispatch} = \text{max tokens} \times (D \times 1\;\mathrm{B} + \text{overhead}) \approx 7.5\,\mathrm{KB} per token
  • buffer size=Nranks×max tokens×msg sizedispatch\text{buffer size} = N_{ranks} \times \text{max tokens} \times \text{msg size}_{dispatch}

Communication time per FusedDispatch satisfies:

Tdispatchα+msg sizedispatchBUB×max tokensT_{dispatch} \approx \alpha + \frac{\text{msg size}_{dispatch}}{B_{UB}} \times \text{max tokens}

Empirical results show that at EP8, FusedDispatch and Combine on CloudMatrix UB require 116–118 μs per NPU die (batch=128), compared to 163–318 μs using reference H800 RDMA interfaces.

Distributed KV-Cache Access

For remote load,

Tget=Lhop+block sizeBUBT_{get} = L_{hop} + \frac{\text{block size}}{B_{UB}}

with typical block sizes (\sim512 KB), enabling sub-100 μs remote KV cache fetches.

3. CloudMatrix-Infer Software Stack

CloudMatrix-Infer is the software stack co-designed for CloudMatrix384 to optimize LLM serving across prefill, decode, and caching.

Peer-to-Peer PDC Disaggregation

  • Prefill Cluster: Implements EP32 per instance (16 NPUs), focusing on context construction and initial KV cache population.
  • Decode Cluster: Scales tightly-coupled MoE inference up to EP320 (160 NPUs).
  • Caching Cluster: Leverages a disaggregated DRAM pool across 32 CPU nodes for both KV and model cache, accessed uniformly over UB.

KV cache transfers between prefill and decode stages utilize an isolated RDMA plane to prevent decode interference.

Large-Scale Expert Parallelism (EP320) and Pipelining

  • FusedDispatch/FusedCombine Operators: AI-vector (AIV) cores directly write to remote NPU memory over UB, mitigating SDMA startup latency. Early quantization (BF16→INT8) reduces message size; pre-allocated double buffers balance dispatch and combine occupancy.
  • Microbatch-Based Decode Pipeline: Decoding employs two interleaved streams (attention, MoE paths), each utilizing distinct AIC/AIV core allocations, providing layer-level latency reduction via overlap.
  • Multiple-Token Prediction (MTP): Utilizes speculative generation, with on-NPU sampling and aggregation, yielding a \sim30% throughput gain (acceptance rate p70%p \approx 70\%).

Hardware-Aware Optimizations

  • Operator Fusion: MLAProlog combines normalization, projection, and positional encoding into a single pass; FusedAttention integrates attention and associated data layout operations.
  • Microbatch Pipeline and Hybrid Parallelism: SP–TP–SP hybridization, with microbatch pipelining overlapping compute and communication, maximizing both AIV and bulk network utilization.
  • INT8 Quantization: Employs mixed precision (INT8 for GEMMs, BF16/FP32 elsewhere), adaptive scale search, and outlier suppression with zero-training and no fine-tuning.

4. Operator and Microbenchmark Evaluation

CloudMatrix384's UB fabric and operator stack deliver high utilization:

Operator Achieved (per NPU) Peak (per NPU) Utilization
MLA on Ascend die 246 TFLOPS 376 TFLOPS 65.4%
INT8 GEMM (128×152) 582–622 TFLOPS 752 TFLOPS 77–83%
Memory BW (MLA) 1,346 GB/s 1,600 GB/s 84.1%

For FusedDispatch/Combine at EP256, UB achieves 149–152 μs latencies and per-rank bandwidth of 54–103 GB/s, improving significantly over RDMA.

5. LLM Serving Performance

CloudMatrix-Infer attains the following efficiency and latency metrics for serving LLMs (e.g., DeepSeek-R1):

  • Prefill phase (4K prompt, EP32): 5,655 tokens/s per NPU (3.76tok/s/TFLOPS3.76\,\mathrm{tok}/s/\mathrm{TFLOPS}, INT8); best-case EPLB: 6,688 tokens/s (4.45tok/s/TFLOPS4.45\,\mathrm{tok}/s/\mathrm{TFLOPS})
  • Decode phase (4K cache, batch=96): TPOT=49.4 ms; throughput=1,943 tokens/s per NPU (1.29tok/s/TFLOPS1.29\,\mathrm{tok}/s/\mathrm{TFLOPS})
  • SLO enforcement: At SLO=30 ms (batch=24): 974 tokens/s, TPOT=24.6 ms; SLO=15 ms (batch=8): 538 tokens/s, TPOT=14.9 ms

Compared to leading GPU-based platforms, CloudMatrix-Infer demonstrates superior per-unit throughput and maintains full model accuracy under INT8 quantization. For example:

System Prefill (tok/s) Prefill (tok/s/TFLOPS) Decode (tok/s) Decode (tok/s/TFLOPS)
CloudMatrix-Infer 6,688 4.45 1,943 1.29
SGLang@H100 6,288 3.18 ~2,172 1.10
DeepSeek@H800 4,026 2.03 1,850 0.93

6. Architectural Significance and Outlook

CloudMatrix384 demonstrates the practical advantages of co-designing hardware interconnects, dynamic pooling strategies, and highly optimized software stacks for contemporary LLM workloads. The peer-to-peer UB fabric, global address space, and scalable expert parallelism support sustained high throughput, low latency, and efficient memory utilization, even under demanding MoE and distributed KV cache workloads. The system establishes new reference points for AI datacenter design, with demonstrated superior utilization and model-serving efficiency relative to conventional GPU clusters. Further research may explore extended scaling properties, scheduling under more heterogeneous batch and context distributions, and architectural adaptations for emergent model paradigms (Zuo et al., 15 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CloudMatrix Datacenter Architecture.