Maximum Absolute Weight (MAW) Pruning
- Maximum Absolute Weight (MAW) is defined as the sum of the maximum absolute weights from the up and gate projections in each GLU-MLP layer, guiding channel pruning decisions.
- MAW pruning selectively removes low-importance neurons, adjusting expansion ratios and enabling a trade-off between factual knowledge retention and improved instruction-following capabilities.
- Empirical results show that MAW enables gradual performance degradation with moderate increases in perplexity while reducing energy use per token and latency in large transformer models.
The Maximum Absolute Weight (MAW) criterion is a structured neuron importance measure used for width pruning in Gated Linear Unit (GLU) multilayer perceptron (MLP) architectures. Within this framework, MAW enables pruning of neural channels by quantifying their representational strength based on the maximum absolute incoming weight in each expansion projection. When applied to Llama-3.2 models, MAW-guided pruning exposes a dichotomy between parametric knowledge retention and instruction-following capabilities, providing a controlled mechanism to study selective degradation and enhancement of cognitive behaviors in transformer LLMs (Martra, 27 Dec 2025).
1. Formal Definition
MAW is defined at the channel (neuron) level within each GLU-MLP layer, which features two expansion weight matrices:
For neuron ,
The combined importance score is given by: This score guides the selection of channels to prune, always treating the paired up/gate projections jointly.
2. Conceptual Rationale
The MAW criterion leverages several principles:
- Neuron influence via absolute weight: The magnitude of weights can proxy a neuron's representational power in the model. The largest absolute weight entering a neuron signifies its strongest coupling to any input feature.
- GLU pairing constraint: Summing MAWs over the up and gate projections respects the functional requirement of GLU layers that neurons be removed in pairs.
- Pruning low-importance channels: Small maximum absolute weights indicate channels whose removal minimally perturbs the information flow, providing a structural regularization effect.
- Parametric knowledge filtration: A plausible implication is that MAW-based pruning excises over-specialized or weakly memorized factual associations, while preserving neurons crucial for sequential or process-driven reasoning.
3. Algorithmic Procedure
Uniform MAW pruning proceeds as follows for each GLU-MLP layer to achieve a target pruning percentage :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
for each GLU-MLP layer L in the model: let W_up, W_gate = L.up_proj.weight, L.gate_proj.weight d_ff = number of neurons in W_up (columns) # 1. Compute per-neuron MAW scores for k in 1…d_ff: maw_up[k] = max(abs(W_up[:,k])) maw_gate[k] = max(abs(W_gate[:,k])) score[k] = maw_up[k] + maw_gate[k] # 2. Decide how many neurons to remove num_remove = floor(p/100 * d_ff) # 3. Select neurons with smallest scores idx_sorted = argsort(score) # ascending prune_idx = idx_sorted[:num_remove] # 4. Remove those neurons from both projections new_W_up = remove_columns(W_up, prune_idx) new_W_gate = remove_columns(W_gate, prune_idx) new_W_down = remove_rows( L.down_proj.weight, prune_idx) # 5. Replace layer weights L.up_proj.weight, L.gate_proj.weight, L.down_proj.weight = new_W_up, new_W_gate, new_W_down end for |
4. Integration with Expansion Ratio Schedules
Pruning serves to adjust the expansion ratio , a core architectural parameter describing the width of GLU-MLP layers relative to the model dimension. For Llama-3.2 models, starting expansion ratios are for 1B and for 3B parameter variants. The final expansion ratio after pruning percent is
The study applied MAW-guided pruning at seven levels: , producing discrete values per model size as follows:
| Model | p (%) | r | d_ff | Notes |
|---|---|---|---|---|
| 1B (4.0×) | 0 | 4.0× | 8192 | baseline |
| 10 | 3.6× | 7373 | ||
| 20 | 3.2× | 6554 | ||
| 30 | 2.8× | 5735 | ... | |
| 40 | 2.4× | 4916 | equilibrium | |
| 50 | 2.0× | 4096 | ||
| 60 | 1.6× | 3277 | ||
| 3B (2.67×) | 0 | 2.67× | 8192 | baseline |
| 10 | 2.40× | 7373 | eq. point | |
| 20 | 2.13× | 6554 | ||
| 30 | 1.87× | 5735 | ||
| 40 | 1.60× | 4916 | ||
| 50 | 1.33× | 4096 | ||
| 60 | 1.07× | 3277 |
Uniform MAW pruning across all GLU-MLP layers enabled systematic investigation into capability transitions at each configuration.
5. Empirical and Theoretical Justifications
Empirical evidence from (Martra, 27 Dec 2025) demonstrates:
- Graceful capability degradation: MAW pruning produces moderate increases in LLM perplexity (Lambada: +259% at 10%, WikiText: +51% at 10%) compared to catastrophic collapse observed with alternative criteria (VOW: +9,207%, PON: +35,440%).
- Capability dichotomy: Factual knowledge tasks (MMLU, GSM8K, perplexity) degrade predictably with width reduction. Conversely, instruction-following (IFEval) and model truthfulness (TruthfulQA-MC2) increase, often substantially (IFEval +46–75%).
- Inverse task correlation: There is a robust inverse correlation (, in Llama-3B) between factual knowledge and truthfulness, directly linking pruning-induced knowledge decay to enhanced capacity for discriminating misconceptions.
Theoretical justifications include:
- Representational strength: Absolute weight magnitude is a proxy for a neuron's contribution. Pruning low-magnitude neurons acts as a regularizer that preferentially excises "weak" facts or features.
- GLU suitability: The summation over up/gate maxima provides a natural, non-fragmenting pairing suitable for GLU architectures.
6. Experimental Results
Key results covering baseline, equilibrium, and aggressive pruning configurations (condensed from Table 3 in (Martra, 27 Dec 2025)):
| Benchmark | 1B (4.0×) | 1B (2.4×) | 1B (1.6×) | 3B (2.67×) | 3B (2.4×) | 3B (1.07×) |
|---|---|---|---|---|---|---|
| MMLU | 0.311 | 0.269 | 0.255 | 0.561 | 0.433 | 0.259 |
| GSM8K | 0.064 | 0.009 | 0.007 | 0.264 | 0.135 | 0.011 |
| IFEval | 0.104 | 0.152 | 0.137 | 0.094 | 0.131 | 0.133 |
| TruthfulQA-MC2 | 0.377 | 0.430 | 0.466 | 0.392 | 0.377 | 0.457 |
At 40% pruning (approximate equilibrium ):
- IFEval scores increased by 46% (1B) and 39% (3B) over baseline.
- MMLU scores decreased to 86% (1B) and 77% (3B) of baseline.
- TruthfulQA-MC2 improved by +23.6% (1B) and +16.7% (3B) over baseline.
- Energy use per token dropped by up to 23% (single-request) and to 4.6× lower (batch size 8), with up to 18% latency cost in interactive settings.
This suggests that MAW pruning selectively weakens parametric knowledge storage while strengthening behavioral alignment and multi-step reasoning.
7. Comparison to Alternative Criteria
Three criteria were evaluated:
| Criterion | Lambada PPL @10% | Δ vs base | WikiText PPL @10% | Δ vs base |
|---|---|---|---|---|
| Baseline | 5.75 | — | 11.57 | — |
| MAW | 20.59 | +259% | 17.50 | +51% |
| VOW | 532.36 | +9,207% | 50.56 | +337% |
| PON | 2,032.80 | +35,440% | 72.52 | +527% |
Only MAW permitted controlled, gradual pruning. Variance of Weights (VOW) and Product of Norms (PON) induced immediate and catastrophic loss of basic language modeling capabilities, making them unsuitable for architectural sweeps or capability trajectory studies.
A plausible implication is that maximum-based criteria are robust to pruning-induced instability, whereas aggregate-statistic criteria can amplify aberrant channel-level effects.
In summary, the Maximum Absolute Weight criterion operationalizes structural pruning via per-channel “max-abs” weight analysis in GLU-MLP layers, summing importance across paired projections to respect architectural coherence. Its empirical utility lies in enabling non-catastrophic capability modulation, revealing an unprecedented dichotomy between factual knowledge retention and instruction-following alignment, and providing systematic control over expansion ratio adjustments in large-scale transformer models (Martra, 27 Dec 2025).