Ascend NPUs: Huawei's AI Accelerator
- Ascend NPUs are specialized AI accelerators that integrate heterogeneous compute engines, advanced memory hierarchies, and precision scaling to boost deep learning performance.
- They enable scalable deployments for large-scale models such as LLMs, multimodal transformers, and MoE systems with high throughput and efficient memory utilization.
- Innovative techniques like operator fusion, mixed-precision kernels, and automatic kernel generation support state-of-the-art AI research and production workloads.
Ascend Neural Processing Units (NPUs) are specialized AI accelerators developed by Huawei to deliver high throughput, efficient memory utilization, and scalable deployments for large-scale deep learning workloads, including LLMs, multimodal transformers, and expert mixture architectures. The architecture is distinguished by heterogeneous compute engines, sophisticated memory hierarchies, and specific support for precision scaling and operator fusion, making it a central platform for leading-edge AI research and production workloads.
1. Hardware Architecture and Computational Model
Ascend NPUs, exemplified by the 910A and 910B series, utilize a cluster of AI cores per chip (e.g., 32 per 910A at 1 GHz), each core integrating distinct compute engines:
- Cube (Tensor) Core: Dedicated for dense matrix multiplication, typically supporting 16×16 FP16 MACs per cycle, achieving hundreds of TFLOPS per chip. In earlier versions (910A), only FP16 is natively supported; later variants (910B) add native FP32 cube support.
- Vector Core: Optimized for elementwise, type-convert, and dequantization ops, managing SIMD and low-rank operations crucial for quantized and mixed-precision workflows.
- Scalar Units: Facilitate control flow, index calculation, and pipeline dispatch.
- Memory Hierarchy: A multi-layered design, including:
- Global memory (DRAM/HBM2) per chip (typ. 32–128 GB, up to 1.8 TB/s).
- Local L1 and L0 buffers tightly coupled to compute engines for tile reuse and high-bandwidth streaming.
- Unified Buffer (UB) for in-core staging and conversion.
- Explicit double-buffering and software-managed data movement, orchestrated by dedicated DMA engines (MTE).
- Decoupled Engine Model: No direct register/caching handoff between vector and cube engines. Inter-core and inter-engine data transfers required, often via global memory, forming distinct compute and data movement phases.
On the system level, Ascend clusters employ hierarchical interconnects—high-speed intra-node mesh for scaling tensor and expert parallelism, and RDMA/Ethernet for inter-node collectives (HCCL/NCCL analogs).
2. Precision Management, Quantization, and Mixed-Precision Kernels
Ascend NPUs provide native hardware support and specialized software strategies for both FP32 emulation and low-bit/mixed-precision execution:
FP32 Emulation on FP16 Hardware
The lack of FP32 tensor cores (e.g., on 910A) is addressed via decomposition-based algorithms:
- H2SGEMM decomposes every FP32 operand into two FP16 values , employing a tunable scaling factor to pack high/low-mantissa bits. The method retains up to 22 mantissa bits and reconstructs the FP32 GEMM via three FP16 GEMMs and FP16→FP32 accumulations. Term-wise accumulation schemes improve numerical stability, particularly for low-exponent inputs. Cache-aware blocking, L1/L0 double buffering, and pipeline overlap yield up to 77% of theoretical FP32-equivalent peak performance, with empirical accuracy matching or exceeding native FP32 CPU GEMMs (Xue et al., 31 Jul 2025).
Mixed-Precision and Quantized Kernels
For LLMs and sparse workloads requiring low RAM footprint and high throughput, Ascend supports weight-only (e.g., W4A16) and activation quantization via specialized kernel and hardware primitives:
- W4A16 Kernels: Vector cores perform on-the-fly INT4→FP16 dequantization, write tiles to DRAM, and cube cores execute high-throughput FP16 GEMM. Split-K parallelization ensures full cube core utilization for small N. Primary bottleneck is DRAM weight transfers: while dequantization is computationally cheap, additional memory round-trips cap practical speedups to ~1.48× over standard FP16×FP16 kernels in PyTorch. The approach enables >1.5× speedups for large-K workloads typical in LLM decoding (He et al., 23 Jan 2026).
- HiFloat Family (HiF8/HiF4): Ascend hardware and microkernels directly support non-IEEE, hierarchical floating-point formats (8 or 4 bits), with fused per-block scaling and dequantization, achieving near-lossless LLM inference with 2–4× compression over 16/8-bit baselines. HiF4, with three-level scale hierarchy, is superior to uniform INT4/MXFP4, particularly for long-context tasks or activation quantization (Zhao et al., 13 Feb 2026).
3. System-Level Parallelism, Operator Fusion, and Training at Scale
Ascend-based systems support a full spectrum of large-scale model training—dense Transformers, MoEs, and RL alignment—through tightly coordinated parallelism and memory strategies:
| Parallelism Type | Description | Ascend Features |
|---|---|---|
| Data (DP), Tensor (TP), Pipeline (PP), Expert (EP) | Multidimensional sharding across N NPUs; e.g., DP=128, TP=8, PP=8 for 8,192-NPU clusters | HCCS mesh, HBM, collectives |
| Virtual Pipeline (VPP) | Adds “virtual stages” per pipeline to reduce bubble ratio | Interleaved scheduling |
| Expert Parallelism (MoE) | Hierarchical inter/intra-node All-to-All for sparse model routing | Token/activation shuffling, hierarchical comm |
- Dense LLMs: Pangu Ultra (135B) applies depth-scaled sandwich normalization, interleaved pipeline parallelism, and fused all-gather/matmul kernels to achieve >90% scaling efficiency and >52% model FLOPS utilization (MFU) across 8,192 NPUs (Yin et al., 10 Apr 2025).
- MoEs: Pangu Ultra MoE (718B) scales expert parallelism using hierarchical All-to-All and activation/parameter recompute + swap, attaining 30% MFU at 1.46M tokens/s on 6K NPUs (Tang et al., 7 May 2025).
- RL Training: MindSpeed RL utilizes a “transfer dock” strategy for distributed sample flow and an allgather–swap protocol for memory-efficient parameter resharding, achieving 1.42–3.97× throughput gains over prior frameworks on 384-NPU pods (Feng et al., 25 Jul 2025).
- Multimodal Training: MindVL, with the MindSpeed-MLLM framework, leverages hybrid DP×TP×PP mapping, batch-wise operator fusion, and adaptive resource scheduling to deliver multimodal LLM training performance matching SOTA GPU setups but with only ~1/10th the training data (Chen et al., 15 Sep 2025).
4. Attention and Specialized Operator Kernels
Operator-level optimization is central to Ascend’s practical performance:
- FastAttention adapts FlashAttention2 for Ascend by employing two-level tiling that exploits L1/L0 echelon buffers of cube cores, tiling-masks to compress memory for causal attention, and pipelined tiling-AllReduce for efficient multi-NPU inference. Achieves up to 10.7× kernel and 5.16× end-to-end speedup on Llama-7B (Lin et al., 2024).
- Deformable Attention implementations refactor memory/compute layout for efficient gather/scatter, fusing microkernels for bilinear sampling, and staggering vector unit/scatter-invocations to avoid memory contention. Results include up to 5.9× forward and 8.9× backward speedup compared to grid-sample baselines (Huang et al., 20 May 2025).
- Fused Operator Libraries: MindSpeed, Pangu, and other systems rely on custom CANN-registered fused kernels for RMSNorm, SwiGLU, QKV projection, and optimizer steps. These kernels are critical for matching or exceeding GPU throughput and maintaining numerical equivalence.
5. Automatic Kernel Generation, DSLs, and LLM-Facilitated Programming
Ascend NPUs feature a vendor-specific C++-like DSL (“AscendC”) with stringent constraints on memory allocation, tiling, copy, and API usage. Manual kernel development is expertise-heavy, but recent work adapts LLMs to facilitate automatic code generation through several innovations:
- AscendCraft: Introduces a high-level Triton-style DSL that abstracts pipeline stages and buffer roles. A four-stage LLM transcompilation pipeline lowers DSL code to AscendC via expert-guided, constraint-enforcing passes, resulting in 98.1% compile and 90.4% correctness (Pass@1), with 46.2% of kernels matching or exceeding PyTorch eager execution (Wen et al., 30 Jan 2026).
- AscendKernelGen: Trains a Qwen3-32B LLM with a chain-of-thought (CoT) dataset derived from real AscendC kernel traces. Supervised fine-tuning (SFT) and reinforcement learning (RL) with execution feedback enable complex kernels: compilation success rates on Level 2 (structured neural primitives) improve from 0% to 95.5% (Pass@10), while functional correctness reaches 64.3% (Cao et al., 12 Jan 2026). A dedicated NPUKernelBench suite quantifies performance and correctness.
Both frameworks stress that without domain-specific constraints and reasoning, generic LLMs fail on NPU kernels; targeted datasets and staged translation pipelines are required.
6. Practical Applications and Impact
Ascend NPUs underpin state-of-the-art deployments across foundational LLMs, multimodal models, and reinforcement learning-driven alignment:
- Large Model Training: Models such as Pangu Ultra 135B/718B MoE and MindVL 8B run at scale, exploiting Ascend’s fused kernel libraries and parallelism schemes. The architecture is validated against the most demanding production and research workloads (Yin et al., 10 Apr 2025, Tang et al., 7 May 2025, Chen et al., 15 Sep 2025).
- Fine-Tuning and Inference: Efficient contrastive decoding for LoRA-adapted models (CoLD) leverages Ascend’s vector units for streamlined per-layer low-rank projection, reducing inference latency by 28% and increasing task accuracy by up to 5.54% over conventional greedy decoding (Heisler et al., 20 May 2025).
- Quantized and Memory-Bound Inference: Support for HiFloat formats and advanced W4A16 kernels enables practical deployment of massive models on modest hardware footprints (He et al., 23 Jan 2026, Zhao et al., 13 Feb 2026).
- Attention-Dense Workloads: FastAttention brings attention-kernel throughput on par with leading GPU solutions (Lin et al., 2024).
7. Design Trade-Offs and Engineering Guidelines
- Compute/Memory/Comm Balance: Ascend hardware is optimized for high compute-to-bandwidth ratios; kernel and system-level design should prioritize matmul-heavy, tile-aligned workloads and minimize memory traffic through double-buffering and fusion.
- Parallelism Granularity: DP/TP/PP/EP must be tuned to device and interconnect topology (e.g., intra-node expert shuffling, virtual pipeline staging) to minimize communication and pipeline bubbles.
- Activation/Parameter Memory Management: Aggressive recomputation and offloading (e.g., tensor swapping) trade extra computation for larger batch size or model fit, especially in sparse/MoE models (Tang et al., 7 May 2025).
- Operator Alignment: Kernel efficiency depends on tiling all shapes and weights to hardware-aligned multiples (e.g., 16 × 16 for cube units).
- Automation and Tuning: LLM-facilitated kernel generation relies on DSL-first interfaces and stepwise lowering, but expert intervention remains key for SOTA performance (Wen et al., 30 Jan 2026, Cao et al., 12 Jan 2026).
Ascend NPUs thus represent a highly tunable, high-throughput AI platform, provided that low-level kernel strategies, precision workflow, and distributed systems engineering are tightly co-designed and hardware constraints are fully exploited.