Papers
Topics
Authors
Recent
2000 character limit reached

Skew-Aware Dynamic Sparsity Allocation

Updated 27 November 2025
  • The paper introduces SDSA, a method that dynamically allocates sparsity based on each layer's skewness to protect critical outlier weights.
  • It employs statistical measures and a softmax-based reweighting scheme to adjust sparsity allocation, ensuring robustness at high global sparsity.
  • Empirical results show SDSA markedly lowers perplexity compared to uniform pruning, even under extreme sparsity levels.

Skew-aware Dynamic Sparsity Allocation (SDSA) is a methodology for allocating layer-wise sparsity in the pruning of LLMs, designed to counteract severe performance degradation commonly encountered under high global pruning ratios. SDSA directly addresses the "outlier value issue" arising from the highly skewed distribution of absolute weights in many transformer layers. By dynamically scaling protection against pruning based on layer-wise skewness, SDSA achieves marked improvements in perplexity (PPL) over uniform and other adaptive sparsity allocation schemes, particularly at aggressive sparsity levels. SDSA is implemented as a lightweight, training-free allocation algorithm compatible with a broad spectrum of unstructured and structured pruning approaches (Kang et al., 19 Nov 2025).

1. Motivation: Outlier Value Issue and the Need for Skew Awareness

Uniform pruning methods, which drop an identical fraction of weights from each layer, implicitly assume all layers tolerate sparsity equally. Empirically, this assumption fails in transformers, where absolute weight distributions in many layers are highly right-skewed: a small number of large-magnitude weights ("outliers") coexist with a long tail of small values. Uniformly pruning such layers can result in the accidental removal of these few critical outlier weights, causing catastrophic increases in perplexity—especially at high pruning ratios (e.g., 70 %).

Experiments reveal that layers with large positive skewness (γ\gamma_\ell) in their absolute weight distributions exhibit extreme vulnerability: uniform pruning at high levels causes their PPL to "explode" (see Figs. 3, 4 and associated results). In contrast, layers with lower (or negative) skewness can withstand more aggressive sparsity with only mild impact. This creates a demand for a sparsity allocation scheme that (i) non-uniformly distributes the pruning budget, (ii) dynamically protects outlier-prone layers, and (iii) ramps up skew sensitivity in proportion to the global sparsity target—precisely the mandate of SDSA (Kang et al., 19 Nov 2025).

2. Formal Specification and Theoretical Foundation

Let LL denote the number of layers, with nn_\ell weights per layer =1,,L\ell = 1,\dots,L. The following definitions and formulas formalize the SDSA allocation process:

  • Layer absolute-magnitude statistics: For layer \ell with weights W={w,i}i=1nW_\ell = \{w_{\ell,i}\}_{i=1}^{n_\ell},
    • Mean: w=1niw,i\overline{w}_\ell = \frac{1}{n_\ell}\sum_i |w_{\ell,i}|
    • Variance: m2,=1ni(w,iw)2m_{2,\ell} = \frac{1}{n_\ell}\sum_i (|w_{\ell,i}|-\overline{w}_\ell)^2
    • Biased skewness: γ=1ni(w,iwm2,)3\gamma_\ell = \frac{1}{n_\ell} \sum_i \left(\frac{|w_{\ell,i}|-\overline{w}_\ell}{\sqrt{m_{2,\ell}}}\right)^3
  • Range normalization (for cross-layer comparability):

γ=1Lj=1Lγj,Δγ=maxjγjγ+ϵ,γ~=γγΔγ\overline{\gamma} = \frac{1}{L}\sum_{j=1}^L \gamma_j, \quad \Delta\gamma = \max_j |\gamma_j - \overline{\gamma}| + \epsilon, \quad \tilde{\gamma}_\ell = \frac{\gamma_\ell - \overline{\gamma}}{\Delta\gamma}

  • Layer protection weights:

ω=exp(βγ~)j=1Lexp(βγ~j)\omega_\ell = \frac{\exp\left( \beta \cdot \tilde{\gamma}_\ell \right)}{\sum_{j=1}^L \exp\left( \beta \cdot \tilde{\gamma}_j \right)}

where ω\omega_\ell increases with γ~\tilde{\gamma}_\ell, implementing greater protection for layers with higher positive skew.

  • Dynamic temperature schedule for β\beta:

To prevent over-allocation, the maximal ratio of protection is bounded by MM (e.g., M=1.8M=1.8):

ωmax/ωmin=exp(βΔγ~)M    β=lnMΔγ~+ϵ\omega_{\max} / \omega_{\min} = \exp(\beta \cdot \Delta \tilde{\gamma}) \leq M \implies \beta = \frac{\ln M}{\Delta \tilde{\gamma} + \epsilon}

The schedule scales with global sparsity SgS_g:

β(Sg)=SglnMΔγ~+ϵ\beta(S_g) = S_g \cdot \frac{\ln M}{\Delta \tilde{\gamma} + \epsilon}

For Sg0S_g \to 0, all ω1/L\omega_\ell \to 1/L (uniform); for large SgS_g, the allocation becomes maximally skew-sensitive.

  • Translate to per-layer sparsity:

Let r=(1Sg)ωr_\ell = (1-S_g)\cdot \omega_\ell (local retention), s=1rs_\ell = 1 - r_\ell (local sparsity). This ensures that the global sparsity constraint is exactly met: sn=SgNtotal\sum_\ell s_\ell n_\ell = S_g N_\text{total}.

3. Algorithmic Workflow and Pseudocode

Within AutoPrune, SDSA operates in the following succinct pseudocode structure:

  1. Importance function discovery: AutoPrune employs a Graph-driven Chain-of-Thought (GCoT) module to search for per-weight importance functions II^*, using calibration data and examining both uniform and SDSA allocations.
  2. Layerwise SDSA allocation:
    • Compute γ\gamma_\ell and normalize to γ~\tilde{\gamma}_\ell for each layer.
    • Use β(Sg)\beta(S_g) to compute softmax protection weights ω\omega_\ell.
    • Derive ss_\ell for each layer.
  3. Application of unstructured pruning:
    • For each layer, calculate II^*
    • Prune the sns_\ell n_\ell least-important weights.

Key parameters:

  • M=1.8M=1.8 (layerwise; $1.5$ for blockwise, see Table D.3)
  • ϵ=108\epsilon=10^{-8} to avoid division by zero.

No training or gradient computation is required by SDSA itself; calibration (128 WikiText-2 samples) is exclusively used during GCoT search, not for SDSA allocation or pruning.

4. Implementation and Computational Considerations

SDSA requires only:

  • An O(Ntotal)O(N_\text{total}) scan over all weights for mean, variance, and skewness computation,
  • An O(L)O(L) softmax operation for per-layer allocation,
  • Reweighting, importance scoring, and mask construction for the pruning procedure.

The minimal overhead of SDSA is negligible compared to the cost of computing activation- or Hessian-based importance metrics. The design further supports structured pruning (e.g., 2:4 or 4:8 blockwise sparsity layouts, with suitable selection of MM), without requiring any additional calibration beyond what is used for importance scoring.

5. Empirical Evaluation: Comparative Performance

On LLaMA-1/2 models pruned to Sg=60%S_g=60\% sparsity and evaluated with WikiText PPL:

Method PPL ΔPPL
SparseGPT 10.51
+ SDSA 8.85 ↓1.66
Wanda 10.66
+ SDSA 9.88 ↓0.78
AutoPrune (GCoT-derived) 9.63
+ SDSA 9.42 ↓0.21

Layer-wise allocation benchmarks across sparsities on LLaMA-1 7B (WikiText):

Sparsity Uniform ER ER-plus OWL SDSA
30 % 5.99 6.02 6.05 6.01 5.98
40 % 6.38 6.55 6.62 6.42 6.33
50 % 7.26 7.74 8.00 7.41 7.13
60 % 10.70 12.16 14.04 12.04 9.88
70 % 85.77 112.03 229.17 175.06 66.93
80 % 3500 11200 6010 3240 892

At every sparsity regime—even the extreme case of 70%—SDSA achieves the lowest perplexity. Competing uniform and simple adaptive methods incur massive PPL increases at high sparsity, while SDSA maintains stability, indicating robust protection of critical outlier weights.

6. Mechanistic Explanation and Potential Extensions

The effectiveness of SDSA is directly attributable to its accurate characterization of "outlierness" through skewness statistics, and the adaptive scaling of protection with the global pruning target SgS_g. At low pruning ratios (Sg<0.3S_g < 0.3), the β\beta schedule ensures all layers are treated nearly uniformly, maximizing resource usage. As pruning intensifies (Sg0.5S_g \geq 0.5), SDSA sharply prioritizes preserving outlier-dominated layers, thus minimizing the risk of accidental removal of salient weights. This mechanism underpins the observed resilience of perplexity, preventing the characteristic "PPL explosion" even at aggressive sparsity.

Potential directions for further research include:

  • Applying SDSA to structured block pruning (e.g., 2:4, 4:8 patterns),
  • Substituting skewness with more sophisticated distributional descriptors (e.g., kurtosis) for even finer allocation,
  • Extending SDSA to architectures with similar outlier phenomena (e.g., vision transformers, CNNs),
  • Integrating SDSA with dynamic, training-time pruning for further compression benefits (Kang et al., 19 Nov 2025).

7. Contextual Significance and Perspectives

SDSA constitutes a robust, principled framework for layer-wise sparsity allocation in LLMs. By leveraging easily-computed distributional metrics, it bypasses the need for expert-designed allocation heuristics or expensive retraining, making it especially valuable for post hoc model compression workflows. Its demonstrated compatibility with multiple pruning strategies, empirical robustness to protection cap MM, and generalizability to both unstructured and blockwise settings highlight its practical importance. The methodology provides a paradigm for respecting the intrinsic statistical structure of neural weights in high-compression regimes, opening avenues for further refinement and domain transfer in neural model pruning research (Kang et al., 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Skew-aware Dynamic Sparsity Allocation (SDSA).