Papers
Topics
Authors
Recent
2000 character limit reached

Pangu Light: Accelerated LLM Pruning

Updated 21 December 2025
  • Pangu Light is a framework that employs multi-axis structured pruning to compress LLMs while maintaining key reasoning performance.
  • It integrates targeted weight re-initialization methods, namely CLAP and SLNP, to stabilize and recover pruned network statistics.
  • Hardware-aware optimizations and norm absorption enable significant inference speed gains with under 2% accuracy degradation.

Pangu Light is a framework for accelerating LLMs through multi-axis structured pruning accompanied by targeted weight re-initialization. Its design addresses the trade-off between inference throughput and reasoning performance, facilitating aggressive model compression with minimal degradation in accuracy. Distinctive contributions include the development of Cross-Layer Attention Pruning (CLAP), Stabilized LayerNorm Pruning (SLNP), and NPU-specific optimizations for post-RMSNorm absorption. The framework demonstrates a higher accuracy–efficiency curve relative to contemporary pruning methods and similarly sized LLM baselines (Chen et al., 26 May 2025).

1. Design Principles and Compression Goals

Pangu Light targets the compression of transformer-based LLMs such as Pangu-38B, whose inference costs inhibit practical deployment. The framework’s primary insight is that aggressive, joint pruning of both depth and width precipitates substantial accuracy loss unless mitigated by dedicated re-initialization. Accordingly, the framework enforces the following design principles:

  • Multi-axis structured pruning encompassing model width (hidden channels), depth (layers), attention heads, and RMSNorm parameters.
  • Forward-activation–driven importance metrics to determine prune candidates.
  • Dedicated weight re-initialization—CLAP for depth and SLNP for width—to restore pruned network statistics.
  • Post-RMSNorm absorption to eliminate redundant normalization at inference, maximizing hardware efficiency.
  • Online knowledge distillation from the reference teacher model (Pangu-38B) during performance recovery.
  • Hardware co-design targeting Ascend NPUs, leveraging fused scaling operations for increased throughput.

A plausible implication is that systematic integration of pruning and re-initialization enables aggressive acceleration regimes—such as 2×–4× speedup—with sub-2 % average-score degradation across reasoning tasks, outperforming Nemotron, Qwen3, and PUZZLE baselines (Chen et al., 26 May 2025).

2. Structured Pruning Axes and Importance Metrics

Pangu Light evaluates four structured pruning axes on a calibration corpus of 10 K–50 K token sequences:

Axis Importance Metric Pruning Target
Width (Channel) Schan(k)S_{\mathrm{chan}(k)} Hidden channels
Attention Head Shead,l,jS_{\mathrm{head},l,j} Attention heads
FFN Neuron Sffn,l,mS_{\mathrm{ffn},l,m} FFN neurons
Depth (Layer) Slayer(l)S_{\mathrm{layer}(l)} Transformer layers

Width pruning: Channel importance Schan(k)S_{\mathrm{chan}(k)} is computed by summing token-level pre-RMSNorm activations across all layers. Channels with lowest scores are uniformly pruned from embedding tables, γl\gamma_l scales, and linear projections.

Attention-head pruning: Within each grouped-query attention (GQA) block, attention head importance Shead,l,jS_{\mathrm{head},l,j} aggregates activation norms; lowest heads within KV groups are pruned, with relevant Q/K/V/output matrix dimensions removed.

FFN-neuron pruning: Neuron score Sffn,l,mS_{\mathrm{ffn},l,m} sums the 2\ell_2 norm of gated FFN activations for each neuron across the calibration corpus, guiding bulk removal from FFN matrices.

Depth pruning: Block importance Slayer(l)S_{\mathrm{layer}(l)} correlates input and output vectors for each layer; lowest scoring layers are pruned, but informative attention groups are preserved via CLAP.

This methodology realizes flexible architectural compression across multiple axes, maximizing the speed–accuracy trade space.

3. Weight Re-Initialization: CLAP and SLNP

Conventional pruning induces instability in pruned networks. Pangu Light introduces two weight re-initialization algorithms that directly address this issue:

3.1 Cross-Layer Attention Pruning (CLAP)

CLAP redistributes the top‐KK most informative KV attention groups from a pruned layer (l+1l+1) to its predecessor (ll), leveraging joint importance Skv(g)S_{\mathrm{kv}(g)}:

Skv(g)=1HeadsgjHeadsgShead,ori(j)S_{\mathrm{kv}(g)} = \frac{1}{|\mathrm{Heads}_g|}\sum_{j\in \mathrm{Heads}_g} S_{\mathrm{head},\,\mathrm{ori}(j)}

CLAP merges Q/K/V matrices and output projections for selected groups, preserving salient cross-layer computations. This procedure stabilizes post-pruning performance, as demonstrated by Minitron + CLAP yielding a +2.9 point improvement over the pruning baseline.

3.2 Stabilized LayerNorm Pruning (SLNP)

Width pruning reduces each RMSNorm’s scale parameter γl\gamma_l. SLNP rescales the surviving γl\gamma_l vector:

cl=γlorig2γlpruned2c_l = \frac{\|\gamma_l^{\mathrm{orig}}\|_2}{\|\gamma_l^{\mathrm{pruned}}\|_2}

γlnew=cl  γlpruned\gamma_l^{\mathrm{new}} = c_l\;\gamma_l^{\mathrm{pruned}}

This operation maintains the pre-pruning output norm, restoring normalization statistics and accelerating convergence in finetuning. Parameter analysis confirms near-identity in mean and variance of RMSNorm scales after SLNP.

4. Post-RMSNorm Absorption and Hardware Co-Optimization

Pangu Light is adapted to the “Sandwich-Norm” architecture, which imposes RMSNorm after both attention and FFN blocks. Post-RMSNorm layers can consume 6 % throughput on Ascend NPUs. Absorption is realized by:

  1. Precomputing sˉinv\bar s_{\mathrm{inv}} (average inverse scale) on calibration corpus.
  2. Absorbing into γ\gamma: γabs=sˉinv  γ\gamma_{\mathrm{abs}} = \bar s_{\mathrm{inv}}\;\gamma.
  3. Replacing the projection matrix WprojW_{\mathrm{proj}} with (Wproj):,j=(Wproj):,j×(γabs)j(W_{\mathrm{proj}}')_{:,j} = (W_{\mathrm{proj}})_{:,j}\times(\gamma_{\mathrm{abs}})_j.

This design eliminates extra normalization overhead at inference, producing a single fused linear layer optimized for NPU execution. Ablation studies show that norm absorption recovers all but 0.9 points of the full Sandwich-Norm performance and matches the more complex DyT (Dynamic Tanh) method.

5. Experimental Protocol and Hyperparameters

The base model for evaluation is Pangu-38B. Three principal pruning targets are selected according to simulation-optimized throughput targets: 1.6× (“Pangu Light-32B”), 2.1×, and 4.2× speedup. Importance-driven pruning ratios are determined by joint minimization of a loss metric and NPU speed simulation. Performance recovery entails continue-training for 300 B tokens (mixed 20 % instruction-following, 36 % reasoning), with knowledge distillation applied from the unpruned teacher logits and a cosine learning-rate schedule (1×1051×1071\times10^{-5}\rightarrow1\times10^{-7}).

Fine-tuning involves 2 M chain-of-thought (CoT) examples in two, six-epoch stages, batch sizes descending 643264\rightarrow32, initial LR 8×1063×1068\times10^{-6}\rightarrow3\times10^{-6} decaying by 90 %.

6. Quantitative Performance and Ablation Analyses

6.1 Throughput on Ascend NPUs

Model Throughput (tokens/s)
Pangu-38B 1631
Pangu Light-1.6× 2585 (+58.5 %)
Pangu Light-2.1× 3403
Pangu Light-4.2× 6831
Qwen2.5-32B 2316
Qwen3-32B 2225
Qwen3-14B 6254

At the 32 B parameter scale, Pangu Light-1.6× yields a 16 % throughput gain over Qwen3-32B.

6.2 Reasoning Benchmarks

Performance on six reasoning-oriented benchmarks (AIME 2024, MATH-500, GPQA, LiveCodeBench, ArenaHard, MMLU-Pro):

Model Avg. Score
GPT-4o-0513 53.3
DeepSeek-R1 81.8
Hunyuan-T1 81.6
Qwen3-32B 80.9
Pangu-38B 82.0
Pangu Light-1.6× 81.6
Pangu Light-2.1× 81.1
Pangu Light-4.2× 79.6

Pangu Light-1.6× retains 99.5 % of Pangu-38B’s performance, exceeds Qwen3-32B by 0.7 points, with 2.1× speedup and only a 1.1 % drop in accuracy—surpassing PUZZLE’s 98.4 % retention at similar acceleration.

6.3 Ablation Studies

  • CLAP and SLNP: On the Minitron baseline pruned to 11 B, CLAP improves the average score by +2.9, SLNP by a further +0.7.
  • Post-Norm Absorption: Matches DyT in restoring performance (59.0 % vs Sandwich-Norm’s 59.9 %), substantially outperforming direct pruning (51.2 %).
  • Parameter statistics: SLNP preserves RMSNorm scale statistics post-pruning.

This suggests that the combined use of CLAP, SLNP, and norm absorption is critical for successful deep pruning and accurate recovery.

7. Accuracy–Efficiency Trade-Off and Framework Impact

Pangu Light constructs a consistently higher accuracy–speed trade-off curve compared to Qwen3 and PUZZLE series. The framework reliably enables 2×–4× acceleration of a 38 B-parameter LLM—empirically with under 2 % average-score degradation on demanding reasoning tasks. Its integrated approach to multi-axis pruning and weight re-initialization, coupled with hardware-aware software fusion, sets a technical precedent for scalable LLM deployment under strict computational constraints (Chen et al., 26 May 2025). A plausible implication is that future LLM compression paradigms may trend toward more integrated co-design, coupling architectural, statistical, and hardware-aware methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Pangu Light Framework.