Papers
Topics
Authors
Recent
Search
2000 character limit reached

Huawei Ascend 910C NPUs

Updated 28 January 2026
  • Huawei Ascend 910C NPUs are specialized AI accelerators defined by high-density compute and extensive memory bandwidth for efficient deep learning workloads.
  • They integrate multiple AI cores with matrix-multiply cube units, wide vector SIMD engines, and hierarchical memory to optimize tensor operations and data movement.
  • They support large-scale training via distributed system co-design, kernel innovations, and quantization-aware optimizations for multi-modal and language models.

Huawei Ascend 910C NPUs are advanced domain-specific AI accelerators within the DaVinci family, architected to provide high-density compute and memory bandwidth for deep learning workloads at scale. They are pivotal in large-scale language modeling, vision transformers, multi-modal model training, and domain-optimized kernel development, offering extensive architectural support for high arithmetic intensity, fine-grained parallelism, and flexible mixed-precision execution. The 910C has been widely studied in the context of code generation frameworks, distributed system co-design, quantization-aware kernel optimizations, and system-level partitioning for sparse and dense models.

1. Architecture and Microarchitectural Features

The Ascend 910C NPU integrates multiple AI cores, each composed of a high-throughput matrix-multiply cube unit (“AIC”), wide vector SIMD engines (“AIV”), scalar units, and multi-level scratchpad memories. A representative single-processor instantiation offers:

  • AI Cores: 8 (per some sources, individual chips may provide 32+ in dual-die variants), each with:
    • 256×256 Tensor (Cube) array
    • 8×8-wide vector pipelines and 8 scalar units
  • Peak Throughput:
    • 400 TFLOPS (BF16/FP16), 100 TFLOPS (FP32), 800 TOPS (INT8) for tensor ops
    • Cube: up to 8 TFLOP/core (FP16), vector: 48 GFLOP/lane (FP32)
  • On-Chip Memory:
    • L1 Buffer: 128–512 KB/core
    • L2 Cache: 32 MB shared, 1 TB/s aggregate on-chip internal bandwidth
    • L0A/L0B/L0C specialized buffers for cube operand staging and partial sums
    • Unified Buffer (UB): 196 KB/core for low-latency tile operations
  • Global Memory:
    • HBM2e, up to 64 GB with 1.8 TB/s bandwidth per chip; per-die HBM on earlier revisions 32–64 GB @ ~1 TB/s
  • Interconnects:
    • 512 GB/s bidirectional mesh network across AI cores
    • PCIe 4.0 ×16 (64 GB/s) for host-NPU communication
    • Distributed HCCL (Huawei Collective Communication Library) over PCIe or TCP for multi-node clusters

Memory transfer engines facilitate fast DMA among global memory, UB, and L0/L1s, supporting overlapping of compute and data movement. Scalar controllers for each compute unit enable independent instruction streams, critical for latency hiding, pipelining, and operator fusion (Cao et al., 12 Jan 2026, Chen et al., 15 Sep 2025, Lin et al., 2024, Tang et al., 7 May 2025, He et al., 23 Jan 2026).

2. Kernel and Operator Design Methodologies

High-performance kernel deployment on the 910C requires both hardware-specific scheduling and DSL optimization, due to its decoupled compute architecture:

  • Cube-Core (AIC) and Vector-Core (AIV) Separation: AIC primarily handles large-tile GEMM (16×16 or 8×8) with double-buffering; AIV is used for element-wise ops, dequantization, reductions, precision casts, and vectorized arithmetic.
  • Tiling and Double-Buffering: Kernels, e.g. FP16 GEMM, are structured to tile matrices so each tile fits UB/L0A/L0B, with asynchronous fetch/computation. Example pseudocode shows pointer-based staging, double buffering on UB, and overlapped compute/transfer (Cao et al., 12 Jan 2026).
  • Split-K Parallelization and Three-Phase Pipelines: In W4A16 quantized kernel design, input is split along K, vector cores perform tile-wise dequantization (INT4→FP16), cube cores process tiled GEMM, and the final vector-core stage reduces partial sums (He et al., 23 Jan 2026).
  • Event-Driven Synchronization: Compute units rely on hardware event signaling from MTEs to maximize occupancy; instruction streams remain independent until data dependencies are resolved.
  • Chain-of-Thought (CoT) Reasoning: Automated kernel generation leverages explicit CoT traces (e.g., for vectorization, predicate handling, or synchronization), which are vital for LLM-based DSL synthesis (Cao et al., 12 Jan 2026).

Operator-level fusion (e.g., fused Softmax, RMSNorm, AdamW), MatMul replacements for inefficient convolutions, and sparse MoE token routing are deployed to maximize L1/L2 locality and keep computational intensity high (Chen et al., 15 Sep 2025).

3. Performance Metrics, Bottlenecks, and Benchmarks

Comprehensive evaluation on 910C leverages:

  • Model FLOPS Utilization (MFU): MFU=achieved FLOPSpeak FLOPS\text{MFU} = \frac{\text{achieved FLOPS}}{\text{peak FLOPS}}, used to capture sustained efficiency. MFU values of 30–40% are typical for large-scale MoE and multimodal models, approaching the hardware roofline with optimal kernel and communication scheduling (Tang et al., 7 May 2025, Chen et al., 15 Sep 2025).
  • Compute and Memory Utilization:
    • Ucompute=measured FLOPSpeak FLOPSU_{compute} = \frac{\text{measured FLOPS}}{\text{peak FLOPS}}
    • Umem=data transfer volumemax memory bandwidth×execution timeU_{mem} = \frac{\text{data transfer volume}}{\text{max memory bandwidth} \times \text{execution time}}
    • (Cao et al., 12 Jan 2026)
  • Throughput (Tokens/s): Achieved training tokens per second per NPU, robust scaling to >1M TPS in 6,000-NPU MoE runs and >2M tokens/sec in 128-NPU multimodal clusters (Tang et al., 7 May 2025, Chen et al., 15 Sep 2025).
  • Operator-level Speedups: FlashAttention remapped via FastAttention achieves up to 10.7× operator-level gains, with 5.16× higher LLaMA-7B throughput against standard attention, and >1.4× end-to-end speedup in 8-way 910C setups (Lin et al., 2024).
  • Kernel Success Rate (Pass@10): LLM-generated kernels see compilation success rates on complex patterns (Level-2) rising from 0% (zero-shot) to 95.49% after RL-based fine-tuning, with functional correctness climbing to 64.29% (Cao et al., 12 Jan 2026).
  • Scaling Efficiency: Strong scaling remains above 94% from 1 to 128 NPUs for distributed multimodal training (Chen et al., 15 Sep 2025).

Bottlenecks include memory bandwidth in quantized kernels—W4A16 speedup is capped at ~1.48× due to extra global memory traversals for dequantized weights (He et al., 23 Jan 2026)—and inter-device communication, addressed via hierarchical AllToAll protocols and adaptive overlapping (Tang et al., 7 May 2025).

4. Large-Scale Distributed Training and System Co-Design

The 910C is central in massive distributed workloads, including trillion-parameter sparse Mixture-of-Experts (MoE) and dense multimodal models:

  • Cluster Topologies: Training is conducted on clusters up to 6,000 NPUs, wired via PCIe and HCCL, using hierarchical communication—inter-node AllGather (lightweight) and intra-node AllToAll—to distribute tokens and expert activations with minimal non-overlapped traffic (Tang et al., 7 May 2025).
  • Training Stack: MindSpeed-MLLM sits atop CANN and MindSpeed-Core/LLM/MM, providing fusion libraries, rematerialization, checkpoint/restart machinery, and hybrid 3D parallelism (data, tensor, and pipeline) (Chen et al., 15 Sep 2025).
  • Memory Management: Immediate recomputation (fine-grained, not whole-layer), tensor swapping to host, asynchronous prefetch, and kernel fusion reduce the on-NPU memory peak and keep 32–64 GB footprints viable for ultra-large models (Tang et al., 7 May 2025).
  • Quantization and Mixed-Precision: Hardware-accelerated BF16, FP16, INT8; software-optimized W4A16 has trade-offs between memory footprint and bandwidth. Dequantization is pipelined but limited by global memory (He et al., 23 Jan 2026).
  • Inference Strategies: Model weight averaging and per-sample grid search for resolution enhance modal robustness at inference, with multicore batching and AllReduce for quick dissemination (Chen et al., 15 Sep 2025).

5. Specialized Algorithms and Software Optimizations

Several algorithmic adaptations are required for NPU efficiency:

  • Multi-Scale Deformable Attention (MSDA): Grid sampling is memory-bound and irregular; the solution involves per-channel UB gathering, DMA-aligned bursts, padding/realignment for type-unaligned data, staggered writeback to mitigate MTE3 contention, and adaptive vector length selection, yielding up to 7.3× end-to-end speedup over the grid_sample baseline (Huang et al., 20 May 2025).
  • FlashAttention Port (FastAttention): A two-level tiling scheme partitions attention into L1 (large tile) and L0 (cube tile) fits, drastic DRAM access reduction, and master-mask tricks to shrink mask memory. AllReduce on tiles is overlapped with computation to reduce communication overhead, enabling ultra-long sequence attention at high throughput (Lin et al., 2024).
  • LLM-based Kernel Generation: AscendKernelGen employs documentation/code-based chain-of-thought data, supervised fine-tuning, and RL with execution feedback on real hardware, leveraging NPUKernelBench for direct metric evaluation. Generated kernels close and sometimes surpass the performance gap with hand-optimized vendor kernels on Level-1/2 operators (Cao et al., 12 Jan 2026).

6. Best Practices, Practical Guidelines, and Future Implications

Effective kernel and workload deployment on 910C NPUs is governed by several empirically validated guidelines:

  • Start with a modular API and host/kernel template exposing tiling and computation schedules; build chain-of-thought traces from expert kernels to guide LLM or manual implementations (Cao et al., 12 Jan 2026).
  • Prefer split-K and aggressive double-buffering in mixed-precision (e.g., W4A16) for K-dominant GEMMs; overlap vector-based dequantization, cube computation, and data movement (He et al., 23 Jan 2026).
  • In distributed settings, select (pd,pt,pp)(p_d, p_t, p_p) dimensions for data/tensor/pipeline parallelism such that the per-stage communication-to-compute ratio Rcomm/compR_{comm/comp} remains <0.15, balancing memory and bandwidth (Chen et al., 15 Sep 2025).
  • Fuse pointwise and small conv ops into MatMul primitives for full cube-core utilization; apply kernel fusions for point-operations (RMSNorm, AdamW, Softmax) to minimize intermediate buffering (Chen et al., 15 Sep 2025).
  • Use hierarchical AllToAll and fine-grained recomputation for MoE in large clusters; adapt tensor swapping and tight host dispatch to mitigate memory pressure (Tang et al., 7 May 2025).

The software–hardware co-design insight shown here is crucial for future NPU developments, with focus areas including direct AIV→AIC data exchange, larger on-chip scratchpads, and extended cube-core ISAs for fused mixed-precision kernels. A plausible implication is that subsequent generations (or firmware updates) incorporating these features will further close the gap between quantized and native-precision kernel performance.


Key references: (Cao et al., 12 Jan 2026, Huang et al., 20 May 2025, Chen et al., 15 Sep 2025, Lin et al., 2024, Tang et al., 7 May 2025, He et al., 23 Jan 2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Huawei Ascend 910C NPUs.