Amber Pruner: Training-Free N:M Activation Sparsity
- Amber Pruner is a training-free N:M activation sparsity method that applies semi-structured masking to accelerate the prefill stage of LLM inference.
- It leverages a Wanda-style norm-sensitive scoring function and selective layer skipping to reduce compute by over 55% with minimal accuracy loss.
- The approach integrates with post-training quantization in the Outstanding-sparse framework, enhancing performance across various LLM architectures.
Amber Pruner is a training-free N:M activation sparsity method designed to accelerate the prefill stage of LLM inference by imposing semi-structured activation sparsity without model retraining. It addresses the inefficiencies and accuracy challenges associated with traditional weight sparsity and training-dependent activation sparsity approaches, providing a unified framework for performance-oriented model compression while preserving accuracy and offering seamless integration with post-training quantization (An et al., 4 Aug 2025).
1. Background and Motivation
The prefill stage in autoregressive LLM inference is dominated by dense linear projections—specifically, key, query, value, and MLP gating operations—which are highly compute-intensive when batched. Structured N:M activation sparsity enforces, within every block of M consecutive activation values, that only the top N (by a specified importance criterion) are preserved while the rest are zeroed. Empirically, activations in LLM linear layers tend to have a large proportion (>50%) of values near zero, presenting an opportunity for computational savings far greater than with weight sparsity, which typically causes substantial (>20%) accuracy loss at comparable sparsity ratios when applied without retraining.
Prior activation sparsity approaches (e.g., Q-Sparse, TEAL, Squared-ReLU) require retraining, leverage nonstandard activation functions, or mainly accelerate the decoding phase. In contrast, Amber Pruner focuses on prefill, where most inference time is spent during multi-batch prompt encoding. The method exploits pre-existing activation sparsity and maps efficiently to sparsity-aware hardware via semi-structured N:M masking, minimizing additional indexing overhead.
2. Methodology
2.1 N:M Activation Sparsity Mechanism
Amber Pruner operates by generating an N:M activation mask per layer in the inference graph. Let denote the activation matrix for a linear projection with batch size and input dimension , partitioned into groups of size . For each entry within group , the robust scoring function is:
where is the weight vector for column standardized by percentile-based outlier removal and normalization, and 0. This “Wanda-style” norm-sensitive scoring preferentially preserves activations linked to more salient weight channels.
Within each group 1, the top 2 activations by 3 are retained; the rest are zeroed:
4
where 5 ensures 6. Sparse masking is carried out before the weight multiplication to enforce the desired sparsity pattern.
2.2 Sensitivity-Driven Layer Skipping
Not all layers are equally robust to activation sparsity. Amber Pruner computes the relative perturbation for each projection:
7
where 8 is the dense output and 9 the output under N:M masked activations. Layers with large 0 are considered sensitive (notably o_proj, up_proj) and exempted from pruning; q_proj and gate_proj are skipped in 5–6 layers, and k_proj/v_proj are pruned selectively, all determined via a fast layerwise scan.
2.3 Algorithmic Workflow
The core mask generation process in the prefill stage is:
0
The outlined approach is entirely training-free and is realized by lightweight per-tensor statistics and efficient top-k selection.
3. Integration with Quantization: Outstanding-sparse Framework
Amber Pruner is integrated within Outstanding-sparse, a pipeline for post-training compression that synergizes N:M activation sparsity with 8-bit quantization (W8A8). The core sequence is:
- Start with a SmoothQuant-quantized model, which scales each channel via 1 for some 2.
- Invert the scale for activations to amplify outliers: 3 (Outstanding-sparse scaling).
- Scale activations as 4; quantize to 8-bit.
- Generate N:M masks on the scaled activations using the Amber Pruner criterion.
- Apply sparse-dense multiplication in 8-bit on W8A8 weights and activations:
5
This unified pipeline boosts the efficacy of activation sparsity (by making outliers more prominent under masking) and ensures minimal loss from quantization.
4. Experimental Results
Empirical benchmarks demonstrate the following:
- Linear Compute Reduction: Amber Pruner achieves 656% pruned linear compute in LLaMA3.1-8B, Qwen2-7B, and Qwen3-30B-A3B (MoE), consistently after layer skipping.
- Accuracy Impact (zero-shot, BFloat16 baseline 7 Amber Pruner):
- 2:4: 8 to 9
- 4:8: 0
- 8:16: 1 to 2
- Outstanding-sparse integration (W8A8 quantization, zero-shot drop vs BFloat16):
- 2:4: 3 to 4
- 4:8: 5 to 6
- 8:16: 7 to 8
- Generative and Long-context Tasks: GSM8K (5-shot) drop 9 at 4:8 and 8:16, 0 at 2:4; LongBench average drop 1 (4:8, 8:16), 2 (2:4); MoE mode often yields improvements or 3 drop at moderate sparsity.
- Throughput and Latency (theoretical speed-up for prefill stage):
- 2:4: 4
- 4:8: 5
- 8:16: 6
- On 87 Ascend 910B hardware, prompt encoding times reduce proportionally in multi-batch regimes.
5. Key Observations and Practical Implications
- Activations naturally exhibit block-structured sparsity that is not present in weights, allowing substantial, hardware-efficient pruning without model retraining or notable accuracy loss (8 at 8:16).
- Score computation is both robust and extensible; channel-wise weight norm boosts accuracy retention by favoring more semantically critical neurons.
- The pruning process and associated mask generation can be applied to both dense and MoE architectures with minimal tuning.
- Outstanding-sparse scaling amplifies high-magnitude activations, synergizing with top-N masking for effective quantization and sparsity without additional accuracy penalty.
- Selective layer skipping, governed by simple sensitivity metrics, is essential for limiting deleterious effects in specific projection layers.
6. Prospective Developments
Recommendations for next-generation AI systems emerging from this work include:
- Development of 16-bit and 8-bit N:M sparse-dense matrix multiplication kernels in hardware and software stacks to maximize real-world speed-up.
- Systematic integration of channel norm and sensitivity estimation routines into inference software for automated, model-agnostic mask scheduling.
- Co-design of LLM architectures (projection group sizes, dimensions) to further exploit intrinsic activation sparsity.
- Extension of similar mask generation strategies to attention pruning and dynamic decoding optimizations for broader inference acceleration.
Amber Pruner and its Outstanding-sparse extension collectively establish training-free N:M activation pruning as a high-utility, low-risk strategy for LLM prefill acceleration, offering compute reductions 9 at negligible loss and setting the stage for hardware–software co-design in emerging LLM pipelines (An et al., 4 Aug 2025).