Papers
Topics
Authors
Recent
Search
2000 character limit reached

Maximum Absolute Weight (MAW) Pruning

Updated 3 January 2026
  • Maximum Absolute Weight (MAW) is defined as the sum of the maximum absolute weights from the up and gate projections in each GLU-MLP layer, guiding channel pruning decisions.
  • MAW pruning selectively removes low-importance neurons, adjusting expansion ratios and enabling a trade-off between factual knowledge retention and improved instruction-following capabilities.
  • Empirical results show that MAW enables gradual performance degradation with moderate increases in perplexity while reducing energy use per token and latency in large transformer models.

The Maximum Absolute Weight (MAW) criterion is a structured neuron importance measure used for width pruning in Gated Linear Unit (GLU) multilayer perceptron (MLP) architectures. Within this framework, MAW enables pruning of neural channels by quantifying their representational strength based on the maximum absolute incoming weight in each expansion projection. When applied to Llama-3.2 models, MAW-guided pruning exposes a dichotomy between parametric knowledge retention and instruction-following capabilities, providing a controlled mechanism to study selective degradation and enhancement of cognitive behaviors in transformer LLMs (Martra, 27 Dec 2025).

1. Formal Definition

MAW is defined at the channel (neuron) level within each GLU-MLP layer, which features two expansion weight matrices:

  • WupRdmodel×dffW_{up} \in \mathbb{R}^{d_{model} \times d_{ff}}
  • WgateRdmodel×dffW_{gate} \in \mathbb{R}^{d_{model} \times d_{ff}}

For neuron k{1,,dff}k \in \{1, \dots, d_{ff}\},

MAWup(k)=maxi[1,dmodel][Wup]i,k\mathrm{MAW}_{up}(k) = \max_{i \in [1, d_{model}]} |[W_{up}]_{i,k}|

MAWgate(k)=maxi[1,dmodel][Wgate]i,k\mathrm{MAW}_{gate}(k) = \max_{i \in [1, d_{model}]} |[W_{gate}]_{i,k}|

The combined importance score sks_k is given by: sk=MAWup(k)+MAWgate(k)s_k = \mathrm{MAW}_{up}(k) + \mathrm{MAW}_{gate}(k) This score guides the selection of channels to prune, always treating the paired up/gate projections jointly.

2. Conceptual Rationale

The MAW criterion leverages several principles:

  • Neuron influence via absolute weight: The magnitude of weights can proxy a neuron's representational power in the model. The largest absolute weight entering a neuron signifies its strongest coupling to any input feature.
  • GLU pairing constraint: Summing MAWs over the up and gate projections respects the functional requirement of GLU layers that neurons be removed in pairs.
  • Pruning low-importance channels: Small maximum absolute weights indicate channels whose removal minimally perturbs the information flow, providing a structural regularization effect.
  • Parametric knowledge filtration: A plausible implication is that MAW-based pruning excises over-specialized or weakly memorized factual associations, while preserving neurons crucial for sequential or process-driven reasoning.

3. Algorithmic Procedure

Uniform MAW pruning proceeds as follows for each GLU-MLP layer to achieve a target pruning percentage pp:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
for each GLU-MLP layer L in the model:
    let W_up, W_gate = L.up_proj.weight, L.gate_proj.weight
    d_ff = number of neurons in W_up (columns)
    # 1. Compute per-neuron MAW scores
    for k in 1d_ff:
        maw_up[k]   = max(abs(W_up[:,k]))
        maw_gate[k] = max(abs(W_gate[:,k]))
        score[k]    = maw_up[k] + maw_gate[k]
    # 2. Decide how many neurons to remove
    num_remove = floor(p/100 * d_ff)
    # 3. Select neurons with smallest scores
    idx_sorted = argsort(score)        # ascending
    prune_idx  = idx_sorted[:num_remove]
    # 4. Remove those neurons from both projections
    new_W_up   = remove_columns(W_up,   prune_idx)
    new_W_gate = remove_columns(W_gate, prune_idx)
    new_W_down = remove_rows(   L.down_proj.weight, prune_idx)
    # 5. Replace layer weights
    L.up_proj.weight, L.gate_proj.weight, L.down_proj.weight = new_W_up, new_W_gate, new_W_down
end for
This protocol is applied identically in every GLU-MLP layer to realize global pruning schedules.

4. Integration with Expansion Ratio Schedules

Pruning serves to adjust the expansion ratio rr, a core architectural parameter describing the width of GLU-MLP layers relative to the model dimension. For Llama-3.2 models, starting expansion ratios are r0=4.0r_0 = 4.0 for 1B and r02.67r_0 \approx 2.67 for 3B parameter variants. The final expansion ratio after pruning pp percent is

r=(1p/100)r0r = (1 - p/100) \cdot r_0

The study applied MAW-guided pruning at seven levels: p{0%,10%,20%,30%,40%,50%,60%}p \in \{0\%, 10\%, 20\%, 30\%, 40\%, 50\%, 60\%\}, producing discrete rr values per model size as follows:

Model p (%) r d_ff Notes
1B (4.0×) 0 4.0× 8192 baseline
10 3.6× 7373
20 3.2× 6554
30 2.8× 5735 ...
40 2.4× 4916 equilibrium
50 2.0× 4096
60 1.6× 3277
3B (2.67×) 0 2.67× 8192 baseline
10 2.40× 7373 eq. point
20 2.13× 6554
30 1.87× 5735
40 1.60× 4916
50 1.33× 4096
60 1.07× 3277

Uniform MAW pruning across all GLU-MLP layers enabled systematic investigation into capability transitions at each configuration.

5. Empirical and Theoretical Justifications

Empirical evidence from (Martra, 27 Dec 2025) demonstrates:

  • Graceful capability degradation: MAW pruning produces moderate increases in LLM perplexity (Lambada: +259% at 10%, WikiText: +51% at 10%) compared to catastrophic collapse observed with alternative criteria (VOW: +9,207%, PON: +35,440%).
  • Capability dichotomy: Factual knowledge tasks (MMLU, GSM8K, perplexity) degrade predictably with width reduction. Conversely, instruction-following (IFEval) and model truthfulness (TruthfulQA-MC2) increase, often substantially (IFEval +46–75%).
  • Inverse task correlation: There is a robust inverse correlation (r=0.864r = -0.864, p=0.012p = 0.012 in Llama-3B) between factual knowledge and truthfulness, directly linking pruning-induced knowledge decay to enhanced capacity for discriminating misconceptions.

Theoretical justifications include:

  • Representational strength: Absolute weight magnitude is a proxy for a neuron's contribution. Pruning low-magnitude neurons acts as a regularizer that preferentially excises "weak" facts or features.
  • GLU suitability: The summation over up/gate maxima provides a natural, non-fragmenting pairing suitable for GLU architectures.

6. Experimental Results

Key results covering baseline, equilibrium, and aggressive pruning configurations (condensed from Table 3 in (Martra, 27 Dec 2025)):

Benchmark 1B (4.0×) 1B (2.4×) 1B (1.6×) 3B (2.67×) 3B (2.4×) 3B (1.07×)
MMLU 0.311 0.269 0.255 0.561 0.433 0.259
GSM8K 0.064 0.009 0.007 0.264 0.135 0.011
IFEval 0.104 0.152 0.137 0.094 0.131 0.133
TruthfulQA-MC2 0.377 0.430 0.466 0.392 0.377 0.457

At 40% pruning (approximate equilibrium r2.4r\sim2.4):

  • IFEval scores increased by 46% (1B) and 39% (3B) over baseline.
  • MMLU scores decreased to 86% (1B) and 77% (3B) of baseline.
  • TruthfulQA-MC2 improved by +23.6% (1B) and +16.7% (3B) over baseline.
  • Energy use per token dropped by up to 23% (single-request) and to 4.6× lower (batch size 8), with up to 18% latency cost in interactive settings.

This suggests that MAW pruning selectively weakens parametric knowledge storage while strengthening behavioral alignment and multi-step reasoning.

7. Comparison to Alternative Criteria

Three criteria were evaluated:

Criterion Lambada PPL @10% Δ vs base WikiText PPL @10% Δ vs base
Baseline 5.75 11.57
MAW 20.59 +259% 17.50 +51%
VOW 532.36 +9,207% 50.56 +337%
PON 2,032.80 +35,440% 72.52 +527%

Only MAW permitted controlled, gradual pruning. Variance of Weights (VOW) and Product of Norms (PON) induced immediate and catastrophic loss of basic language modeling capabilities, making them unsuitable for architectural sweeps or capability trajectory studies.

A plausible implication is that maximum-based criteria are robust to pruning-induced instability, whereas aggregate-statistic criteria can amplify aberrant channel-level effects.


In summary, the Maximum Absolute Weight criterion operationalizes structural pruning via per-channel “max-abs” weight analysis in GLU-MLP layers, summing importance across paired projections to respect architectural coherence. Its empirical utility lies in enabling non-catastrophic capability modulation, revealing an unprecedented dichotomy between factual knowledge retention and instruction-following alignment, and providing systematic control over expansion ratio adjustments in large-scale transformer models (Martra, 27 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Maximum Absolute Weight (MAW) Criterion.