Maximum Absolute Weight (MAW) Pruning

Updated 3 January 2026

Maximum Absolute Weight (MAW) is defined as the sum of the maximum absolute weights from the up and gate projections in each GLU-MLP layer, guiding channel pruning decisions.
MAW pruning selectively removes low-importance neurons, adjusting expansion ratios and enabling a trade-off between factual knowledge retention and improved instruction-following capabilities.
Empirical results show that MAW enables gradual performance degradation with moderate increases in perplexity while reducing energy use per token and latency in large transformer models.

The Maximum Absolute Weight (MAW) criterion is a structured neuron importance measure used for width pruning in Gated Linear Unit (GLU) multilayer perceptron (MLP) architectures. Within this framework, MAW enables pruning of neural channels by quantifying their representational strength based on the maximum absolute incoming weight in each expansion projection. When applied to Llama-3.2 models, MAW-guided pruning exposes a dichotomy between parametric knowledge retention and instruction-following capabilities, providing a controlled mechanism to study selective degradation and enhancement of cognitive behaviors in transformer LLMs (Martra, 27 Dec 2025).

1. Formal Definition

MAW is defined at the channel (neuron) level within each GLU-MLP layer, which features two expansion weight matrices:

$W_{up} \in \mathbb{R}^{d_{model} \times d_{ff}}$
$W_{gate} \in \mathbb{R}^{d_{model} \times d_{ff}}$

For neuron $k \in \{1, \dots, d_{ff}\}$ ,

$\mathrm{MAW}_{up}(k) = \max_{i \in [1, d_{model}]} |[W_{up}]_{i,k}|$

$\mathrm{MAW}_{gate}(k) = \max_{i \in [1, d_{model}]} |[W_{gate}]_{i,k}|$

The combined importance score $s_k$ is given by: $s_k = \mathrm{MAW}_{up}(k) + \mathrm{MAW}_{gate}(k)$ This score guides the selection of channels to prune, always treating the paired up/gate projections jointly.

2. Conceptual Rationale

The MAW criterion leverages several principles:

Neuron influence via absolute weight: The magnitude of weights can proxy a neuron's representational power in the model. The largest absolute weight entering a neuron signifies its strongest coupling to any input feature.
GLU pairing constraint: Summing MAWs over the up and gate projections respects the functional requirement of GLU layers that neurons be removed in pairs.
Pruning low-importance channels: Small maximum absolute weights indicate channels whose removal minimally perturbs the information flow, providing a structural regularization effect.
Parametric knowledge filtration: A plausible implication is that MAW-based pruning excises over-specialized or weakly memorized factual associations, while preserving neurons crucial for sequential or process-driven reasoning.

3. Algorithmic Procedure

Uniform MAW pruning proceeds as follows for each GLU-MLP layer to achieve a target pruning percentage $p$ :

for each GLU-MLP layer L in the model:
    let W_up, W_gate = L.up_proj.weight, L.gate_proj.weight
    d_ff = number of neurons in W_up (columns)
    # 1. Compute per-neuron MAW scores
    for k in 1…d_ff:
        maw_up[k]   = max(abs(W_up[:,k]))
        maw_gate[k] = max(abs(W_gate[:,k]))
        score[k]    = maw_up[k] + maw_gate[k]
    # 2. Decide how many neurons to remove
    num_remove = floor(p/100 * d_ff)
    # 3. Select neurons with smallest scores
    idx_sorted = argsort(score)        # ascending
    prune_idx  = idx_sorted[:num_remove]
    # 4. Remove those neurons from both projections
    new_W_up   = remove_columns(W_up,   prune_idx)
    new_W_gate = remove_columns(W_gate, prune_idx)
    new_W_down = remove_rows(   L.down_proj.weight, prune_idx)
    # 5. Replace layer weights
    L.up_proj.weight, L.gate_proj.weight, L.down_proj.weight = new_W_up, new_W_gate, new_W_down
end for

This protocol is applied identically in every GLU-MLP layer to realize global pruning schedules.

4. Integration with Expansion Ratio Schedules

Pruning serves to adjust the expansion ratio $r$ , a core architectural parameter describing the width of GLU-MLP layers relative to the model dimension. For Llama-3.2 models, starting expansion ratios are $r_0 = 4.0$ for 1B and $r_0 \approx 2.67$ for 3B parameter variants. The final expansion ratio after pruning $p$ percent is

$r = (1 - p/100) \cdot r_0$

The study applied MAW-guided pruning at seven levels: $p \in \{0\%, 10\%, 20\%, 30\%, 40\%, 50\%, 60\%\}$ , producing discrete $r$ values per model size as follows:

Model	p (%)	r	d_ff	Notes
1B (4.0×)	0	4.0×	8192	baseline
	10	3.6×	7373
	20	3.2×	6554
	30	2.8×	5735	...
	40	2.4×	4916	equilibrium
	50	2.0×	4096
	60	1.6×	3277
3B (2.67×)	0	2.67×	8192	baseline
	10	2.40×	7373	eq. point
	20	2.13×	6554
	30	1.87×	5735
	40	1.60×	4916
	50	1.33×	4096
	60	1.07×	3277

Uniform MAW pruning across all GLU-MLP layers enabled systematic investigation into capability transitions at each configuration.

5. Empirical and Theoretical Justifications

Empirical evidence from (Martra, 27 Dec 2025) demonstrates:

Graceful capability degradation: MAW pruning produces moderate increases in LLM perplexity (Lambada: +259% at 10%, WikiText: +51% at 10%) compared to catastrophic collapse observed with alternative criteria (VOW: +9,207%, PON: +35,440%).
Capability dichotomy: Factual knowledge tasks (MMLU, GSM8K, perplexity) degrade predictably with width reduction. Conversely, instruction-following (IFEval) and model truthfulness (TruthfulQA-MC2) increase, often substantially (IFEval +46–75%).
Inverse task correlation: There is a robust inverse correlation ( $r = -0.864$ , $p = 0.012$ in Llama-3B) between factual knowledge and truthfulness, directly linking pruning-induced knowledge decay to enhanced capacity for discriminating misconceptions.

Theoretical justifications include:

Representational strength: Absolute weight magnitude is a proxy for a neuron's contribution. Pruning low-magnitude neurons acts as a regularizer that preferentially excises "weak" facts or features.
GLU suitability: The summation over up/gate maxima provides a natural, non-fragmenting pairing suitable for GLU architectures.

6. Experimental Results

Key results covering baseline, equilibrium, and aggressive pruning configurations (condensed from Table 3 in (Martra, 27 Dec 2025)):

Benchmark	1B (4.0×)	1B (2.4×)	1B (1.6×)	3B (2.67×)	3B (2.4×)	3B (1.07×)
MMLU	0.311	0.269	0.255	0.561	0.433	0.259
GSM8K	0.064	0.009	0.007	0.264	0.135	0.011
IFEval	0.104	0.152	0.137	0.094	0.131	0.133
TruthfulQA-MC2	0.377	0.430	0.466	0.392	0.377	0.457

At 40% pruning (approximate equilibrium $r\sim2.4$ ):

IFEval scores increased by 46% (1B) and 39% (3B) over baseline.
MMLU scores decreased to 86% (1B) and 77% (3B) of baseline.
TruthfulQA-MC2 improved by +23.6% (1B) and +16.7% (3B) over baseline.
Energy use per token dropped by up to 23% (single-request) and to 4.6× lower (batch size 8), with up to 18% latency cost in interactive settings.

This suggests that MAW pruning selectively weakens parametric knowledge storage while strengthening behavioral alignment and multi-step reasoning.

7. Comparison to Alternative Criteria

Three criteria were evaluated:

Criterion	Lambada PPL @10%	Δ vs base	WikiText PPL @10%	Δ vs base
Baseline	5.75	—	11.57	—
MAW	20.59	+259%	17.50	+51%
VOW	532.36	+9,207%	50.56	+337%
PON	2,032.80	+35,440%	72.52	+527%

Only MAW permitted controlled, gradual pruning. Variance of Weights (VOW) and Product of Norms (PON) induced immediate and catastrophic loss of basic language modeling capabilities, making them unsuitable for architectural sweeps or capability trajectory studies.

A plausible implication is that maximum-based criteria are robust to pruning-induced instability, whereas aggregate-statistic criteria can amplify aberrant channel-level effects.

In summary, the Maximum Absolute Weight criterion operationalizes structural pruning via per-channel “max-abs” weight analysis in GLU-MLP layers, summing importance across paired projections to respect architectural coherence. Its empirical utility lies in enabling non-catastrophic capability modulation, revealing an unprecedented dichotomy between factual knowledge retention and instruction-following alignment, and providing systematic control over expansion ratio adjustments in large-scale transformer models (Martra, 27 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2 (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Maximum Absolute Weight (MAW) Criterion.