Papers
Topics
Authors
Recent
Search
2000 character limit reached

Amber Pruner: Training-Free N:M Activation Sparsity

Updated 20 May 2026
  • Amber Pruner is a training-free N:M activation sparsity method that applies semi-structured masking to accelerate the prefill stage of LLM inference.
  • It leverages a Wanda-style norm-sensitive scoring function and selective layer skipping to reduce compute by over 55% with minimal accuracy loss.
  • The approach integrates with post-training quantization in the Outstanding-sparse framework, enhancing performance across various LLM architectures.

Amber Pruner is a training-free N:M activation sparsity method designed to accelerate the prefill stage of LLM inference by imposing semi-structured activation sparsity without model retraining. It addresses the inefficiencies and accuracy challenges associated with traditional weight sparsity and training-dependent activation sparsity approaches, providing a unified framework for performance-oriented model compression while preserving accuracy and offering seamless integration with post-training quantization (An et al., 4 Aug 2025).

1. Background and Motivation

The prefill stage in autoregressive LLM inference is dominated by dense linear projections—specifically, key, query, value, and MLP gating operations—which are highly compute-intensive when batched. Structured N:M activation sparsity enforces, within every block of M consecutive activation values, that only the top N (by a specified importance criterion) are preserved while the rest are zeroed. Empirically, activations in LLM linear layers tend to have a large proportion (>50%) of values near zero, presenting an opportunity for computational savings far greater than with weight sparsity, which typically causes substantial (>20%) accuracy loss at comparable sparsity ratios when applied without retraining.

Prior activation sparsity approaches (e.g., Q-Sparse, TEAL, Squared-ReLU) require retraining, leverage nonstandard activation functions, or mainly accelerate the decoding phase. In contrast, Amber Pruner focuses on prefill, where most inference time is spent during multi-batch prompt encoding. The method exploits pre-existing activation sparsity and maps efficiently to sparsity-aware hardware via semi-structured N:M masking, minimizing additional indexing overhead.

2. Methodology

2.1 N:M Activation Sparsity Mechanism

Amber Pruner operates by generating an N:M activation mask per layer in the inference graph. Let XRB×dinX \in \mathbb{R}^{B \times d_{in}} denote the activation matrix for a linear projection with batch size BB and input dimension dind_{in}, partitioned into G=din/MG = d_{in}/M groups of size MM. For each entry Xi,jX_{i,j} within group gg, the robust scoring function is:

Si,j=Xi,jf(W~:,j)S^*_{i,j} = |X_{i,j}| \cdot f(\widetilde{W}_{:,j})

where W~:,j\widetilde{W}_{:,j} is the weight vector for column jj standardized by percentile-based outlier removal and normalization, and BB0. This “Wanda-style” norm-sensitive scoring preferentially preserves activations linked to more salient weight channels.

Within each group BB1, the top BB2 activations by BB3 are retained; the rest are zeroed:

BB4

where BB5 ensures BB6. Sparse masking is carried out before the weight multiplication to enforce the desired sparsity pattern.

2.2 Sensitivity-Driven Layer Skipping

Not all layers are equally robust to activation sparsity. Amber Pruner computes the relative perturbation for each projection:

BB7

where BB8 is the dense output and BB9 the output under N:M masked activations. Layers with large dind_{in}0 are considered sensitive (notably o_proj, up_proj) and exempted from pruning; q_proj and gate_proj are skipped in 5–6 layers, and k_proj/v_proj are pruned selectively, all determined via a fast layerwise scan.

2.3 Algorithmic Workflow

The core mask generation process in the prefill stage is:

Xi,jX_{i,j}0

The outlined approach is entirely training-free and is realized by lightweight per-tensor statistics and efficient top-k selection.

3. Integration with Quantization: Outstanding-sparse Framework

Amber Pruner is integrated within Outstanding-sparse, a pipeline for post-training compression that synergizes N:M activation sparsity with 8-bit quantization (W8A8). The core sequence is:

  1. Start with a SmoothQuant-quantized model, which scales each channel via dind_{in}1 for some dind_{in}2.
  2. Invert the scale for activations to amplify outliers: dind_{in}3 (Outstanding-sparse scaling).
  3. Scale activations as dind_{in}4; quantize to 8-bit.
  4. Generate N:M masks on the scaled activations using the Amber Pruner criterion.
  5. Apply sparse-dense multiplication in 8-bit on W8A8 weights and activations:

dind_{in}5

This unified pipeline boosts the efficacy of activation sparsity (by making outliers more prominent under masking) and ensures minimal loss from quantization.

4. Experimental Results

Empirical benchmarks demonstrate the following:

  • Linear Compute Reduction: Amber Pruner achieves dind_{in}656% pruned linear compute in LLaMA3.1-8B, Qwen2-7B, and Qwen3-30B-A3B (MoE), consistently after layer skipping.
  • Accuracy Impact (zero-shot, BFloat16 baseline dind_{in}7 Amber Pruner):
    • 2:4: dind_{in}8 to dind_{in}9
    • 4:8: G=din/MG = d_{in}/M0
    • 8:16: G=din/MG = d_{in}/M1 to G=din/MG = d_{in}/M2
  • Outstanding-sparse integration (W8A8 quantization, zero-shot drop vs BFloat16):
    • 2:4: G=din/MG = d_{in}/M3 to G=din/MG = d_{in}/M4
    • 4:8: G=din/MG = d_{in}/M5 to G=din/MG = d_{in}/M6
    • 8:16: G=din/MG = d_{in}/M7 to G=din/MG = d_{in}/M8
  • Generative and Long-context Tasks: GSM8K (5-shot) drop G=din/MG = d_{in}/M9 at 4:8 and 8:16, MM0 at 2:4; LongBench average drop MM1 (4:8, 8:16), MM2 (2:4); MoE mode often yields improvements or MM3 drop at moderate sparsity.
  • Throughput and Latency (theoretical speed-up for prefill stage):
    • 2:4: MM4
    • 4:8: MM5
    • 8:16: MM6
    • On 8MM7 Ascend 910B hardware, prompt encoding times reduce proportionally in multi-batch regimes.

5. Key Observations and Practical Implications

  • Activations naturally exhibit block-structured sparsity that is not present in weights, allowing substantial, hardware-efficient pruning without model retraining or notable accuracy loss (MM8 at 8:16).
  • Score computation is both robust and extensible; channel-wise weight norm boosts accuracy retention by favoring more semantically critical neurons.
  • The pruning process and associated mask generation can be applied to both dense and MoE architectures with minimal tuning.
  • Outstanding-sparse scaling amplifies high-magnitude activations, synergizing with top-N masking for effective quantization and sparsity without additional accuracy penalty.
  • Selective layer skipping, governed by simple sensitivity metrics, is essential for limiting deleterious effects in specific projection layers.

6. Prospective Developments

Recommendations for next-generation AI systems emerging from this work include:

  • Development of 16-bit and 8-bit N:M sparse-dense matrix multiplication kernels in hardware and software stacks to maximize real-world speed-up.
  • Systematic integration of channel norm and sensitivity estimation routines into inference software for automated, model-agnostic mask scheduling.
  • Co-design of LLM architectures (projection group sizes, dimensions) to further exploit intrinsic activation sparsity.
  • Extension of similar mask generation strategies to attention pruning and dynamic decoding optimizations for broader inference acceleration.

Amber Pruner and its Outstanding-sparse extension collectively establish training-free N:M activation pruning as a high-utility, low-risk strategy for LLM prefill acceleration, offering compute reductions MM9 at negligible loss and setting the stage for hardware–software co-design in emerging LLM pipelines (An et al., 4 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Amber Pruner.