Huawei CloudMatrix Platform
- Huawei CloudMatrix is an integrated AI datacenter platform featuring specialized NPUs, CPUs, and a unified high-bandwidth network for dynamic resource pooling.
- It optimizes large-scale deep learning and cloud workloads through advanced scheduling, elastic training, and hardware-aware operator fusion.
- CloudMatrix enhances operational reliability with real-time anomaly detection, scalable resource allocation, and automated monitoring frameworks.
Huawei CloudMatrix is a production-grade AI datacenter and cloud management platform developed by Huawei, designed to address diverse computational, reliability, and optimization needs at hyperscale. It encompasses hardware, software, and algorithmic innovations supporting large-scale deep learning, real-time inference, cloud service resource allocation, observability, and operational automation. CloudMatrix integrates high-bandwidth hardware interconnects, dynamic pooling of compute and memory resources, advanced scheduling algorithms, and monitoring frameworks to support demanding workloads such as LLM serving, elastic training, and secure operations.
1. CloudMatrix Datacenter Architecture
CloudMatrix features a tightly integrated hardware-software architecture focused on AI training and inference at unprecedented scale. The CloudMatrix384 supernode comprises:
- Core Hardware:
- 384 Ascend 910 NPUs: Each NPU (dual-die) provides 376 TFLOPS (BF16/FP16), supports INT8 operations, and is equipped with high-memory bandwidth (1.6 TB/s per die).
- 192 Kunpeng CPUs: Four CPUs per node (paired with eight NPUs), interconnected in a NUMA mesh, aggregating substantial DRAM for memory pooling.
- Unified Bus (UB) Network: A high-bandwidth, low-latency interconnect that unifies all CPUs and NPUs, enabling direct peer-to-peer (all-to-all) communication across the supernode. Non-blocking two-tier Clos topology ensures near-uniform latency and bandwidth both intra- and inter-node.
- Disaggregated Memory Pool: CPU DRAM across the supernode forms a versatile, UB-accessible memory pool supporting distributed caching for models and key-value (KV) stores.
- Complementary Planes:
- RDMA Plane for scale-out, GPU-to-GPU communication, and KV cache transfer.
- Virtual Private Cloud (VPC) Plane for connection to external data center and storage.
This design enables dynamic pooling and flexible deployment of computation and memory resources, decoupled from legacy constraints imposed by node or rack boundaries.
2. Advanced LLM Serving: CloudMatrix-Infer
CloudMatrix-Infer is the LLM serving solution built to exploit CloudMatrix’s hardware capabilities. Its three core innovations are:
- Peer-to-Peer Disaggregation (PDC) for Prefill, Decode, and Caching:
- The inference process is decomposed into specialized clusters for prefill, decode, and caching. Each cluster is independently scalable and mapped to its own set of (potentially overlapping) NPUs and CPUs.
- The UB network enables stateless, flexible scheduling and uniform distributed cache access, eliminating the rigid data-locality and affinity scheduling constraints seen in conventional KV cache-centric GPU clusters.
- Large-scale Expert Parallelism (LEP/EP320):
- Supports expert parallelism degrees up to EP320 (320 experts each mapped to an NPU die), critical for Mixture-of-Experts (MoE) models.
- Efficient all-to-all token dispatch between routers and expert nodes leverages fused operators—"FusedDispatch" and "FusedCombine"—that minimize communication and synchronization overhead by directly utilizing NPU vector cores and overlapping computation with communication.
- Hardware-Aware Optimizations:
- Customized operator fusion (e.g., for attention and normalization), microbatch-based pipelining that adapts to computational heterogeneity, and hierarchical INT8 quantization.
- Optimized quantization is achieved via adaptive scale search, outlier suppression, and block-level error compensation without requiring retraining, enabling high efficiency while preserving model accuracy.
3. Performance Evaluation
Evaluation on the DeepSeek-R1 model (671B parameters, INT8 quantized) with 256 Ascend 910C NPUs demonstrates:
- Prefill Throughput: Up to 6,688 tokens/s per NPU (4,096-length input, batch size 16,384), or 4.45 tokens/s/TFLOPS, which surpasses contemporary SGLang/H100 benchmarks.
- Decode Throughput: 1,943 tokens/s per NPU (<50 ms Time Per Output Token, TPOT). At 15 ms TPOT, throughput is 538 tokens/s per NPU.
- Scalability: Maintains high efficiency as expert parallelism and batch size increase.
- Accuracy: INT8 quantization preserves model accuracy across 16 representative benchmarks, matching or exceeding publicly reported DeepSeek results.
These results are attributed to the high-bandwidth UB network (≤3% inter-node bandwidth degradation, sub-2μs latency), software with fused and cache-efficient operators, and scalable distributed caching.
4. Resource Scheduling and Allocation in CloudMatrix
CloudMatrix extends beyond AI acceleration into general-purpose cloud resource management and elasticity:
- Elastic Training Resource Allocation (2109.03389):
- Utilizes a mixed-integer programming (MIP) model for dynamic node allocation to deep learning jobs.
- Allocator is designed for near-real-time operation (<0.4s per decision) and leverages rolling-horizon planning to maximize overall training progress, reduce queueing, and accelerate job completion.
- Empirical results in Huawei ModelArts show up to 32% reduction in queueing time and 24% higher throughput over greedy baselines.
- Dynamic Vector Bin Packing for Resource Efficiency (2205.08769):
- Applies a data reduction algorithm that shrinks resource allocation instances by over an order of magnitude (for ε=0.02) at the cost of at most 2% optimality loss.
- Enables near-optimal or high-quality approximate scheduling for CPU, memory, and other resource vectors, critical for large dynamic workloads.
- Probabilistic Capacity Allocation (2209.08820):
- Implements a multi-class queueing network model utilizing diffusion approximations of stochastic offered load.
- Heuristic, pooling-based capacity allocation can reduce resource requirements by 20% relative to siloed reservation while improving SLA compliance and utilization.
- Empirically validated on production Huawei CloudMatrix workloads.
5. Operational Observability and Reliability
CloudMatrix incorporates advanced monitoring and reliability frameworks:
- Functional Cluster Inference with Prism (2308.07638):
- Prism enables non-intrusive inference of functional clusters among instances by combining coarse trace-based partitioning (Jaccard similarity, LSH/MinHash) with fine-grained metric-based agglomerative clustering (dynamic time warping of resource usage patterns).
- Validated by achieving a v-measure of ~0.95 on internal datasets and supporting production use-cases such as vulnerability identification (e.g., all replicas on one host) and latent fault aggregation.
- Anomaly Detection with CMAnomaly (2308.09937):
- Employs "Collaborative Machine" (Editor's term) architecture for fast, unsupervised detection of anomalies in multivariate monitoring metrics.
- Linear-time computation via factorization machine reformulations enables processing of high-frequency data with thousands of metrics.
- Demonstrates superior F1 score (0.9494, +6.77% to +10.68% over baselines) and scalability, already deployed in troubleshooting for Huawei Cloud online services.
6. Application in Communications, AI, and Serverless Cloud
CloudMatrix integrates into 5G and serverless cloud architectures:
- 5G Network Localization and Optimization (1806.07447):
- Supports centralized, CSI-fingerprint-based localization for massive MIMO, using spatial covariance fingerprinting and learning-based regression (e.g., extreme learning machines).
- High spatial resolution and scalability are made possible by CloudMatrix's ability to aggregate and process large volumes of multi-site CSI data, automate retraining, and support network management (e.g., beamforming, scheduling).
- Serverless Workload Characterization and Load Management (2312.10127):
- Empirical studies on Huawei's Function-as-a-Service (FaaS) workloads reveal high variability in resource demands, cold start times, and popularity.
- CloudMatrix's resource disaggregation and unified scheduling are aligned with the need for fine-grained, adaptive autoscaling, overcommitment management, and predictive resource allocation in heterogeneous, bursty serverless environments.
7. Security Considerations
Security practices for CloudMatrix, particularly as related to cloud-integrated FPGAs (2005.04867), involve:
- Isolation of FPGA instances (single-tenant per FPGA) and shell-managed access control, in line with industry best practice.
- Runtime and static checks but limitations exist regarding bitstream encryption, remote attestation, and user IP confidentiality.
- The evolving landscape of side-channel, fault-injection, and covert-channel attacks necessitates ongoing research and system hardening.
CloudMatrix thus represents a comprehensive, production-ready platform for large-scale cloud, AI, and communications workloads, incorporating novel hardware, resource orchestration, observability, and operational methodologies, with validated empirical benefits in efficiency, reliability, and adaptability across a diverse set of applications and deployment scenarios.