Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semi-Structured N:M Sparsity

Updated 11 March 2026
  • Semi-structured N:M sparsity is a fine-grained pruning method that enforces exactly N nonzero entries in each M-element block, balancing flexibility with hardware efficiency.
  • It enables significant computational speedups on modern accelerators such as NVIDIA Sparse Tensor Cores, achieving up to 6.3× acceleration while maintaining high model fidelity.
  • Research advances include rule-based, ADMM-based, and probabilistic optimization approaches for both weights and activations, improving performance in large language and computer vision models.

Semi-structured N:M sparsity is a fine-grained structural pruning paradigm for neural network weights or activations, in which every contiguous group (block) of M elements is constrained to have at most or exactly N nonzero entries. This approach enables a balance between the flexibility and accuracy retention of unstructured pruning and the hardware efficiency of coarse-grained structured pruning. Modern accelerators, especially NVIDIA's Sparse Tensor Cores, natively support common N:M patterns such as 2:4, allowing for significant computational speedups while maintaining high model fidelity. N:M sparsity is applicable to both weights and activations and is central to state-of-the-art compression and acceleration techniques for deep neural networks, especially LLMs and computer vision models (Kao et al., 2022, Xiang et al., 2023, Ma et al., 3 Mar 2025).

1. Mathematical Foundation and Constraint Formalism

Let WRd×mW\in\mathbb{R}^{d\times m} be a weight matrix of a fully connected or convolutional layer. The N:M semi-structured sparsity constraint partitions each row into B=m/MB = \lceil m/M \rceil disjoint, contiguous blocks of size MM (Kao et al., 2022, Yu et al., 2024). Within every block bb, exactly NN of the MM weights are nonzero, enforced via a binary mask M{0,1}d×mM\in\{0,1\}^{d\times m}:

i{1,,d},  b{1,,B}: Mi,b0=N\forall i \in \{1,\ldots,d\},\;\forall b\in\{1,\ldots,B\}:~ \|M_{i,b}\|_0 = N

This structure is enforced for weights, and similarly, semi-structured sparsity can be defined for activations by applying the same blockwise NN-nonzero constraint to each block in the activation vector (Alanova et al., 26 Sep 2025, An et al., 4 Aug 2025).

The semi-structured pattern allows arbitrary choice of the NN nonzeros within each MM-sized block, distinguishing it from global unstructured sparsity (maximum flexibility, poor hardware mapping) and from block/channel/pruning (low flexibility, high hardware efficiency).

2. Practical Algorithms for N:M Sparsification

Contemporary methods for enforcing N:M sparsity fall into "rule-based," "combinatorial/continuous optimization," and "probabilistic" frameworks:

  • Blockwise Top-N Magnitude Pruning: The simplest strategy selects the NN largest-magnitude weights per MM-block, updating the mask after each pruning cycle (Kao et al., 2022, Guo et al., 3 Sep 2025).
  • Mask Decay and Structure Decay (Decaying Pruning): Schedules such as "pruning mask decay" introduce a decay parameter α(t)\alpha(t) on pruned weights, allowing gradients to flow and weights to vanish smoothly; "structure decay" (decay schedule β(t)\beta(t)) softens blockwise importance scores over time, enabling more exploration of block configurations in early training (Kao et al., 2022).
  • ADMM-based Constrained Optimization: The NxMTransformer framework formulates the N:M constraint as a joint optimization using Alternating Direction Method of Multipliers (ADMM). At each iteration, a step projects the weight tensor onto the feasible mask set (per-block top-N selection) (Holmes et al., 2021).
  • Combinatorial and Probabilistic Mask Learning: Divide-and-conquer approaches (LBC) enumerate all (MN)\binom{M}{N} possible kept-weight combinations per block, learning scores over candidates and gradually pruning away low-scoring subsets with scheduled annealing and a straight-through estimator (STE) (Zhang et al., 2022). In large models, MaskLLM and MaskPro use learnable distributions over the mask space via Gumbel-Softmax (Fang et al., 2024) or categorical sampling with REINFORCE and variance reduction (Sun et al., 15 Jun 2025) to enable end-to-end differentiability and practical scaling.
  • Dependency-Aware Pruning: In multi-projection MLPs (e.g., LLMs with SwiGLU), importance scores are calibrated by both weight magnitude and the activation norm of corresponding dependency groups, boosting signal for weights coupled to high-energy neurons (Guo et al., 2024).
  • Multi-Axis and Hierarchical Selection: MaxQ performs sequential multi-axis block selection (e.g., filter and kernel axes in CNNs) with soft mask generation and gradual sparsity ramp-up for improved block selection and accuracy (Xiang et al., 2023). Hierarchical N:M (HiNM) pruning further composes coarse-grained vector sparsity with fine-grained rowwise N:M, necessitating specialized output/input channel permutation (gyro-permutation) for accuracy recovery (Yu et al., 2024).

3. Activation Sparsity and Training-Free Techniques

Recent work has extended N:M sparsity to activation tensors for runtime compression and dynamic, input-adaptive sparsity. Training-free algorithms such as Amber Pruner apply calibrated channel norms—derived from weight statistics—to weight the activation magnitudes before top-N blockwise pruning, optionally skipping highly sensitive layers by precomputed perturbation error (An et al., 4 Aug 2025). Lightweight error mitigation strategies include per-block threshold tuning, layerwise scale correction, and residual Taylor expansion (to correct for pruned activation contributions in downstream layers) (Alanova et al., 26 Sep 2025).

The selection of block size and the per-block NN (e.g., 2:4, 8:16) is tuned for hardware alignment, with 8:16 achieving a favorable accuracy/speed trade-off on 16-lane vector units. Practical guidelines suggest using small held-out calibration sets for tuning and selecting sparsity not exceeding the hardware's theoretical acceleration by more than ~20%.

4. Hardware Mapping and Algorithmic Acceleration

N:M sparsity is particularly well-suited for recent hardware:

  • NVIDIA Sparse Tensor Cores support 2:4 sparsity at the hardware level, accepting block masks and natively skipping zero computations, yielding speedups approaching M/NM/N in compute-bound settings (Kao et al., 2022, Ma et al., 3 Mar 2025). Support for additional block sizes (4:8, 8:16) is emerging (Yu et al., 2024, An et al., 4 Aug 2025).
  • Data Layout and Compression: Compressed storage involves retaining N values and log2(MN)\log_2\binom{M}{N} bits of metadata per block. Hardware mapping requires prepacking weights and sometimes activations, and per-layer memory access is optimized by hierarchical blocking and decompression only as needed (Ma et al., 3 Mar 2025).
  • Training/Inference Acceleration: Structured kernels exploit regular patterns for coalesced memory access and vectorized execution. NM-SpMM, for example, implements hierarchical blocking and pipelined memory operations to achieve up to 6.3×6.3\times speedup over cuBLAS dense GEMM and 2.1×2.1\times over prior N:M-sparse libraries at high sparsity ratios (e.g., 87.5%, 1:8) (Ma et al., 3 Mar 2025).
  • Transposable N:M Masks: For accelerating both the forward and backward pass (where WW^\top is used), transposable N:M sparsity imposes additional row/column cardinality constraints in each M×M block. The TSENOR framework solves mask generation as an optimal transport problem with tensorized Dykstra projections and efficient rounding for scalability (Meng et al., 29 May 2025).

5. Trade-Offs: Accuracy, Compression, and Security

N:M sparsity offers a favorable accuracy/performance trade-off at moderate sparsity. For example, Transformer Big trained with 2:4 sparsity achieves <0.1 BLEU loss versus dense at 2× FLOP reduction (Kao et al., 2022), ResNet-50 with 1:4 or 2:4 sparsity matches or outperforms dense top-1 accuracy, and LLaMA2-7B under 2:4 with CAST achieves only +0.09 perplexity at a 0.36% gain in zero-shot accuracy compared to dense model (Huang et al., 30 Sep 2025).

Higher sparsity ratios and larger block sizes (e.g., 1:16, 16:32) are possible but risk more substantial accuracy drops unless mitigated with advanced mask learning and calibration (Zhang et al., 2022, Meng et al., 29 May 2025). Moreover, combining N:M sparsity with low-bit quantization (e.g., 1.58-bit BitNet) is synergistic: ternarized models exhibit higher compatibility, improved mask stability under joint STE optimization, and can tolerate up to 62.5% sparsity before collapse (Zhang et al., 5 Mar 2026). Empirical results indicate >1.3× throughput gains at near-dense quality.

However, N:M sparsity introduces new security vulnerabilities. The Silent Until Sparse (SUS) backdoor attack implants dormant backdoors into potential mask-retained weights, hidden pre-pruning and reliably triggered post-pruning, evading existing detection and remaining robust to fine-tuning (Guo et al., 3 Sep 2025).

6. Scaling, Transferability, and Extensions

Learnable mask distributions (MaskLLM, MaskPro) scale N:M sparsity to large LLMs and across domains via transfer learning of mask logits, with adaptation and calibration yielding near-lossless compression on downstream tasks (Fang et al., 2024, Sun et al., 15 Jun 2025). CAST demonstrates that differentiable, continuous sparsity-aware training with proportional L1 decay and row/group scaling enables convergence to optimal sparse minima, matching or exceeding prior SOTA under tight retraining budgets (Huang et al., 30 Sep 2025).

Hierarchical N:M (HiNM) and permutation strategies—combining vector and fine-grained row-wise sparsity—approach unstructured sparsity accuracy at up to 75% sparsity, facilitated by gyro-permutation and permutation-aware GPU kernels (Yu et al., 2024).

Activation-level N:M sparsity, as demonstrated by Amber Pruner and Outstanding-sparse, achieves up to 56% FLOP reduction in prefill for LLMs, approaching ~1.7× speedup on sparsity-aware hardware with negligible to sub-1% accuracy loss after simple plug-and-play error mitigation (An et al., 4 Aug 2025, Alanova et al., 26 Sep 2025).

7. Limitations and Future Directions

Current limitations include hardware restrictions to block sizes (primarily 2:4), the complexity of mask generation for large block sizes or transposable constraints, and the need for matched kernel/software support for both weights and activations. Advanced rounding, permutation, and mask optimization methods are being developed for scalability and better accuracy retention at higher sparsity (Meng et al., 29 May 2025, Yu et al., 2024).

Open directions encompass support for dynamically adaptive N, activation sparsity in both prefill and decode, co-design of network architectures and hardware for end-to-end N:M-aware computation, and integration with quantization, distillation, and continual learning paradigms (Huang et al., 30 Sep 2025, Zhang et al., 5 Mar 2026, An et al., 4 Aug 2025).


Key References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semi-Structured N:M Sparsity.