CloudMatrix384 Supernode

Updated 30 June 2025

CloudMatrix384 Supernode is a high-performance AI datacenter architecture that integrates 384 NPUs and 192 CPUs to scale large language model inference.
It employs a peer-to-peer network and resource disaggregation to deliver uniform, low-latency communication and dynamic workload allocation.
Innovative hardware-software co-design, fused operators, and INT8 quantization collectively drive significant throughput improvements and energy efficiency.

CloudMatrix384 Supernode comprises a tightly integrated AI datacenter architecture engineered for efficient LLM inference and serves as a foundational unit within Huawei's CloudMatrix platform. Its design addresses specific challenges encountered in scaling, communication efficiency, memory bandwidth, and workload elasticity for state-of-the-art LLM deployments—particularly those employing mixture-of-experts (MoE) architectures and large distributed key-value (KV) caches. From both computational and graph-theoretic perspectives, the CloudMatrix384 Supernode is characterized by its peer-to-peer topology, resource disaggregation, and advanced graph-based signal-processing capabilities, leveraging distinct innovations across hardware-software co-design and signal analysis frameworks.

1. Architecture and Hardware Integration

CloudMatrix384 Supernode integrates 384 Ascend 910C Neural Processing Units (NPUs) and 192 Kunpeng CPUs, interconnected via the Unified Bus (UB), a non-blocking, all-to-all, ultra-high-bandwidth network. Each node houses four Kunpeng CPUs and eight Ascend NPUs, operating as independent, dynamically assignable resources.

Ascend 910C NPUs each provide up to 752 INT8 TFLOPS (per die), supporting high-throughput matrix computation for transformer inference.
Unified Bus (UB) delivers >390 GB/s per-NPU unidirectional bandwidth, with inter-node bandwidth degradation below 3% and latency increases below 1 μs, enabling uniform communication latencies and bandwidths irrespective of physical node boundary.
Resource disaggregation allows CPUs, NPUs, and DRAM to be allocated elastically. Memory pooling across the supernode is critical for large model parameter storage and distributed caching.
Software stack is built on CANN (compute architecture for neural networks), which interfaces with mainstream ML frameworks, orchestrating AI computation and data movement at hardware speed and scale.

This architecture enables the supernode to function as a tightly-coupled computational entity, mitigating bottlenecks often encountered in traditional AI clusters reliant on PCIe, NVLink, or InfiniBand for inter-node connectivity.

2. Peer-to-Peer Serving and Resource Pooling

The CloudMatrix384 Supernode adopts a peer-to-peer architecture for LLM serving through the CloudMatrix-Infer framework:

Disaggregation of Prefill, Decode, and Caching Tasks:
- The serving pipeline is partitioned into prefill (prompt processing), decode (sequential generation), and caching (KV cache management).
- Each cluster (prefill, decode, cache) can scale independently, matching computing resources to request profiles.
Direct, Stateless Communication:
- Any NPU or CPU can access any memory location in the DRAM pool over the UB, eliminating data locality constraints.
- Request scheduling is stateless; the assignment of user queries does not need to consider KV cache placement.
Contrast with Conventional Infrastructures:
- Prior solutions bind request routing to node-locality due to slow cross-node transfers, resulting in straggler effects and underutilized memory.
- CloudMatrix384's uniform interconnect topology flattens these constraints.

This stateless, peer-to-peer model results in improved DRAM utilization, more predictable latency profiles, and better adaptability to bursty workloads.

3. Large-Scale Expert Parallelism and Operator Optimization

CloudMatrix384 Supernode is tuned for large-scale expert parallelism required by sparse MoE LLM architectures, as exemplified by the DeepSeek-R1 LLM (256+ experts):

Expert Parallelism (EP320):
- Each NPU die can be allocated to a unique expert, permitting 320-way expert parallelism.
Fused Communication Operators:
- Custom FusedDispatch and FusedCombine primitives leverage AIV-Direct (Ascend AI vector direct memory writes) to perform remote peer transfers, reducing overhead compared to traditional DMA-based approaches.
- Early INT8 quantization reduces inter-NPU communication payloads; for example, MoE token dispatches use 7kB+scale per token.
Static Buffering:
- Pre-allocated double-buffered memory regions (of size $\text{buffer\_size} = \text{rank\_num} \times \text{max\_tokens} \times \text{msg\_size}$ , where $\text{msg\_size}$ benefits from quantization reduction) enable efficient, contention-free dynamic routing of tokens to experts.

MoE dispatch/combine achieves 103–131 GB/s bandwidth over UB, outperforming contemporary InfiniBand (H800) configurations. Microbatch-based pipelining further maximizes concurrency between compute and communication.

4. Hardware-Aware Optimization and Quantized Inference

CloudMatrix-Infer incorporates hardware-specific operator and kernel optimizations:

Fused Operators:
- Layers such as RMSNorm, projections, and RoPE are combined into larger operator kernels, mitigating kernel launch overhead and maximizing per-core utilization.
KV Cache Optimization:
- KV cache is stored in the Ascend NPUs’ native NZ layout, bypassing explicit format conversions.
Distributed Caching via UB:
- KV and model block caches are managed by a cluster-wide, DRAM+SSD-based key-value store accessed over UB. Copying overhead is minimized (1× DRAM overhead for all instances in contrast to 8× for local cache replication).
INT8 Quantization:
- Key matrix operations (FFN, dense, attention) are quantized to INT8 using adaptive scale search: $s^* = \underset{s}{\arg\min} \| Q(W \cdot s)(s^{-1} X) - WX \|$ , with per-token activation and per-channel weight granularity.
- Block-wise clipping with parameter $\alpha$ minimizes block error: $\min_\alpha \| \mathrm{Block}(X;W) - \mathrm{Block}(X;Q(W;\alpha))\|$ .
- Mixed-precision policy retains BF16/FP32 for sensitive operations.

INT8 quantization yields a 4–5× throughput improvement over full-precision with negligible impact on LLM accuracy across a range of benchmarks, including MMLU and HumanEval.

5. System Performance and Scalability

Empirical evaluation using DeepSeek-R1 demonstrates the following performance metrics for CloudMatrix384 Supernode:

Metric	Value/Result
Prefill Throughput (tokens/s/NPU)	6,688
Decode Throughput (tokens/s/NPU)	1,943
SLO-constrained Throughput (15 ms)	538
MoE Dispatch Bandwidth (UB)	103–131 GB/s (256–8 way EP)
MLA (Attention) Operator Utilization	up to 65.4% of MCU peak; 84.1% of memory bandwidth
Model Load Time (INT8, DRAM)	~320s for 8 instances, 1× DRAM overhead
LLM Accuracy (INT8)	Within margin or matching full-precision across tasks

The supernode achieves higher compute efficiency (up to 4.45 tokens/s/TFLOPS for prefill, 1.29 tokens/s/TFLOPS for decode) than reported for SGLang on H100 GPUs. Under high KV cache reuse (≥90%), prefill throughput sees a 2.28× increase and TTFT (time-to-first-token) drops by 59%.

6. Supernodes and Multiscale Graph Signal Processing

Beyond LLM infrastructure, the term "supernode" is central to multiscale graph signal analysis frameworks wherein each supernode represents a community or subgraph arising from graph partitioning algorithms (e.g., Louvain method, Infomap) (Tremblay et al., 2015). In such frameworks:

Partition-based Coarsening: The original graph is decomposed into connected subgraphs; each subgraph $\mathcal{G}^{(k)}$ becomes a supernode, inheriting edges that aggregate the inter-subgraph connections.
Filterbank Design: Local Laplacian eigenmodes are computed per subgraph, yielding a biorthogonal basis supporting critical sampling, compression, and denoising of graph signals.
Mathematical Formulation: Local Fourier modes for each subgraph, e.g., $\bm{q}_1^{(k)} = \frac{1}{\sqrt{|\mathcal{G}^{(k)}|}}\mathbf{1}$ for local averages, are zero-padded and assembled across the graph.

Within CloudMatrix384 graph contexts, supernodes enable adaptive, structure-aware aggregation, supporting information compression and noise suppression compatible with complex, community-rich compute and data networks.

7. Summary Table of Features and Impacts

Component / Technique	Contribution / Impact
Unified Bus (UB)	High-bandwidth, all-to-all resource coupling
Resource Disaggregation	Pooling of compute and memory for workload elasticity
Peer-to-peer Serving	Stateless scheduling, uniform access, elastic scaling
Expert Parallelism (EP320)	Maximum hardware utilization for MoE LLMs
Hardware-aware Operator Design	Fused kernels, minimized memory transformations
Distributed Caching (EMS)	Fast, DRAM-based shared model/context cache
INT8 Quantization	Substantial throughput gains, stable accuracy
Multiscale Supernodes (graph analysis)	Adaptive, community-based signal decomposition

CloudMatrix384 Supernode, through tight hardware-software codesign and graph-based abstraction, demonstrates high throughput, elasticity, and efficiency across both practical LLM inference and advanced graph signal processing domains, setting operational baselines for future large-scale AI infrastructure.

PDF Markdown Chat (Upgrade)

References (1)

1.

Subgraph-based filterbanks for graph signals (2015)