Amber Pruner: Training-Free N:M Activation Sparsity

Updated 20 May 2026

Amber Pruner is a training-free N:M activation sparsity method that applies semi-structured masking to accelerate the prefill stage of LLM inference.
It leverages a Wanda-style norm-sensitive scoring function and selective layer skipping to reduce compute by over 55% with minimal accuracy loss.
The approach integrates with post-training quantization in the Outstanding-sparse framework, enhancing performance across various LLM architectures.

Amber Pruner is a training-free N:M activation sparsity method designed to accelerate the prefill stage of LLM inference by imposing semi-structured activation sparsity without model retraining. It addresses the inefficiencies and accuracy challenges associated with traditional weight sparsity and training-dependent activation sparsity approaches, providing a unified framework for performance-oriented model compression while preserving accuracy and offering seamless integration with post-training quantization (An et al., 4 Aug 2025).

1. Background and Motivation

The prefill stage in autoregressive LLM inference is dominated by dense linear projections—specifically, key, query, value, and MLP gating operations—which are highly compute-intensive when batched. Structured N:M activation sparsity enforces, within every block of M consecutive activation values, that only the top N (by a specified importance criterion) are preserved while the rest are zeroed. Empirically, activations in LLM linear layers tend to have a large proportion (>50%) of values near zero, presenting an opportunity for computational savings far greater than with weight sparsity, which typically causes substantial (>20%) accuracy loss at comparable sparsity ratios when applied without retraining.

Prior activation sparsity approaches (e.g., Q-Sparse, TEAL, Squared-ReLU) require retraining, leverage nonstandard activation functions, or mainly accelerate the decoding phase. In contrast, Amber Pruner focuses on prefill, where most inference time is spent during multi-batch prompt encoding. The method exploits pre-existing activation sparsity and maps efficiently to sparsity-aware hardware via semi-structured N:M masking, minimizing additional indexing overhead.

2. Methodology

2.1 N:M Activation Sparsity Mechanism

Amber Pruner operates by generating an N:M activation mask per layer in the inference graph. Let $X \in \mathbb{R}^{B \times d_{in}}$ denote the activation matrix for a linear projection with batch size $B$ and input dimension $d_{in}$ , partitioned into $G = d_{in}/M$ groups of size $M$ . For each entry $X_{i,j}$ within group $g$ , the robust scoring function is:

$S^*_{i,j} = |X_{i,j}| \cdot f(\widetilde{W}_{:,j})$

where $\widetilde{W}_{:,j}$ is the weight vector for column $j$ standardized by percentile-based outlier removal and normalization, and $B$ 0. This “Wanda-style” norm-sensitive scoring preferentially preserves activations linked to more salient weight channels.

Within each group $B$ 1, the top $B$ 2 activations by $B$ 3 are retained; the rest are zeroed:

$B$ 4

where $B$ 5 ensures $B$ 6. Sparse masking is carried out before the weight multiplication to enforce the desired sparsity pattern.

2.2 Sensitivity-Driven Layer Skipping

Not all layers are equally robust to activation sparsity. Amber Pruner computes the relative perturbation for each projection:

$B$ 7

where $B$ 8 is the dense output and $B$ 9 the output under N:M masked activations. Layers with large $d_{in}$ 0 are considered sensitive (notably o_proj, up_proj) and exempted from pruning; q_proj and gate_proj are skipped in 5–6 layers, and k_proj/v_proj are pruned selectively, all determined via a fast layerwise scan.

2.3 Algorithmic Workflow

The core mask generation process in the prefill stage is:

$X_{i,j}$ 0

The outlined approach is entirely training-free and is realized by lightweight per-tensor statistics and efficient top-k selection.

3. Integration with Quantization: Outstanding-sparse Framework

Amber Pruner is integrated within Outstanding-sparse, a pipeline for post-training compression that synergizes N:M activation sparsity with 8-bit quantization (W8A8). The core sequence is:

Start with a SmoothQuant-quantized model, which scales each channel via $d_{in}$ 1 for some $d_{in}$ 2.
Invert the scale for activations to amplify outliers: $d_{in}$ 3 (Outstanding-sparse scaling).
Scale activations as $d_{in}$ 4; quantize to 8-bit.
Generate N:M masks on the scaled activations using the Amber Pruner criterion.
Apply sparse-dense multiplication in 8-bit on W8A8 weights and activations:

$d_{in}$ 5

This unified pipeline boosts the efficacy of activation sparsity (by making outliers more prominent under masking) and ensures minimal loss from quantization.

4. Experimental Results

Empirical benchmarks demonstrate the following:

Linear Compute Reduction: Amber Pruner achieves $d_{in}$ 656% pruned linear compute in LLaMA3.1-8B, Qwen2-7B, and Qwen3-30B-A3B (MoE), consistently after layer skipping.
Accuracy Impact (zero-shot, BFloat16 baseline $d_{in}$ 7 Amber Pruner):
- 2:4: $d_{in}$ 8 to $d_{in}$ 9
- 4:8: $G = d_{in}/M$ 0
- 8:16: $G = d_{in}/M$ 1 to $G = d_{in}/M$ 2
Outstanding-sparse integration (W8A8 quantization, zero-shot drop vs BFloat16):
- 2:4: $G = d_{in}/M$ 3 to $G = d_{in}/M$ 4
- 4:8: $G = d_{in}/M$ 5 to $G = d_{in}/M$ 6
- 8:16: $G = d_{in}/M$ 7 to $G = d_{in}/M$ 8
Generative and Long-context Tasks: GSM8K (5-shot) drop $G = d_{in}/M$ 9 at 4:8 and 8:16, $M$ 0 at 2:4; LongBench average drop $M$ 1 (4:8, 8:16), $M$ 2 (2:4); MoE mode often yields improvements or $M$ 3 drop at moderate sparsity.
Throughput and Latency (theoretical speed-up for prefill stage):
- 2:4: $M$ 4
- 4:8: $M$ 5
- 8:16: $M$ 6
- On 8 $M$ 7 Ascend 910B hardware, prompt encoding times reduce proportionally in multi-batch regimes.

5. Key Observations and Practical Implications

Activations naturally exhibit block-structured sparsity that is not present in weights, allowing substantial, hardware-efficient pruning without model retraining or notable accuracy loss ( $M$ 8 at 8:16).
Score computation is both robust and extensible; channel-wise weight norm boosts accuracy retention by favoring more semantically critical neurons.
The pruning process and associated mask generation can be applied to both dense and MoE architectures with minimal tuning.
Outstanding-sparse scaling amplifies high-magnitude activations, synergizing with top-N masking for effective quantization and sparsity without additional accuracy penalty.
Selective layer skipping, governed by simple sensitivity metrics, is essential for limiting deleterious effects in specific projection layers.

6. Prospective Developments

Recommendations for next-generation AI systems emerging from this work include:

Development of 16-bit and 8-bit N:M sparse-dense matrix multiplication kernels in hardware and software stacks to maximize real-world speed-up.
Systematic integration of channel norm and sensitivity estimation routines into inference software for automated, model-agnostic mask scheduling.
Co-design of LLM architectures (projection group sizes, dimensions) to further exploit intrinsic activation sparsity.
Extension of similar mask generation strategies to attention pruning and dynamic decoding optimizations for broader inference acceleration.

Amber Pruner and its Outstanding-sparse extension collectively establish training-free N:M activation pruning as a high-utility, low-risk strategy for LLM prefill acceleration, offering compute reductions $M$ 9 at negligible loss and setting the stage for hardware–software co-design in emerging LLM pipelines (An et al., 4 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Amber Pruner.