Pangu Light: Accelerated LLM Pruning
- Pangu Light is a framework that employs multi-axis structured pruning to compress LLMs while maintaining key reasoning performance.
- It integrates targeted weight re-initialization methods, namely CLAP and SLNP, to stabilize and recover pruned network statistics.
- Hardware-aware optimizations and norm absorption enable significant inference speed gains with under 2% accuracy degradation.
Pangu Light is a framework for accelerating LLMs through multi-axis structured pruning accompanied by targeted weight re-initialization. Its design addresses the trade-off between inference throughput and reasoning performance, facilitating aggressive model compression with minimal degradation in accuracy. Distinctive contributions include the development of Cross-Layer Attention Pruning (CLAP), Stabilized LayerNorm Pruning (SLNP), and NPU-specific optimizations for post-RMSNorm absorption. The framework demonstrates a higher accuracy–efficiency curve relative to contemporary pruning methods and similarly sized LLM baselines (Chen et al., 26 May 2025).
1. Design Principles and Compression Goals
Pangu Light targets the compression of transformer-based LLMs such as Pangu-38B, whose inference costs inhibit practical deployment. The framework’s primary insight is that aggressive, joint pruning of both depth and width precipitates substantial accuracy loss unless mitigated by dedicated re-initialization. Accordingly, the framework enforces the following design principles:
- Multi-axis structured pruning encompassing model width (hidden channels), depth (layers), attention heads, and RMSNorm parameters.
- Forward-activation–driven importance metrics to determine prune candidates.
- Dedicated weight re-initialization—CLAP for depth and SLNP for width—to restore pruned network statistics.
- Post-RMSNorm absorption to eliminate redundant normalization at inference, maximizing hardware efficiency.
- Online knowledge distillation from the reference teacher model (Pangu-38B) during performance recovery.
- Hardware co-design targeting Ascend NPUs, leveraging fused scaling operations for increased throughput.
A plausible implication is that systematic integration of pruning and re-initialization enables aggressive acceleration regimes—such as 2×–4× speedup—with sub-2 % average-score degradation across reasoning tasks, outperforming Nemotron, Qwen3, and PUZZLE baselines (Chen et al., 26 May 2025).
2. Structured Pruning Axes and Importance Metrics
Pangu Light evaluates four structured pruning axes on a calibration corpus of 10 K–50 K token sequences:
| Axis | Importance Metric | Pruning Target |
|---|---|---|
| Width (Channel) | Hidden channels | |
| Attention Head | Attention heads | |
| FFN Neuron | FFN neurons | |
| Depth (Layer) | Transformer layers |
Width pruning: Channel importance is computed by summing token-level pre-RMSNorm activations across all layers. Channels with lowest scores are uniformly pruned from embedding tables, scales, and linear projections.
Attention-head pruning: Within each grouped-query attention (GQA) block, attention head importance aggregates activation norms; lowest heads within KV groups are pruned, with relevant Q/K/V/output matrix dimensions removed.
FFN-neuron pruning: Neuron score sums the norm of gated FFN activations for each neuron across the calibration corpus, guiding bulk removal from FFN matrices.
Depth pruning: Block importance correlates input and output vectors for each layer; lowest scoring layers are pruned, but informative attention groups are preserved via CLAP.
This methodology realizes flexible architectural compression across multiple axes, maximizing the speed–accuracy trade space.
3. Weight Re-Initialization: CLAP and SLNP
Conventional pruning induces instability in pruned networks. Pangu Light introduces two weight re-initialization algorithms that directly address this issue:
3.1 Cross-Layer Attention Pruning (CLAP)
CLAP redistributes the top‐ most informative KV attention groups from a pruned layer () to its predecessor (), leveraging joint importance :
CLAP merges Q/K/V matrices and output projections for selected groups, preserving salient cross-layer computations. This procedure stabilizes post-pruning performance, as demonstrated by Minitron + CLAP yielding a +2.9 point improvement over the pruning baseline.
3.2 Stabilized LayerNorm Pruning (SLNP)
Width pruning reduces each RMSNorm’s scale parameter . SLNP rescales the surviving vector:
This operation maintains the pre-pruning output norm, restoring normalization statistics and accelerating convergence in finetuning. Parameter analysis confirms near-identity in mean and variance of RMSNorm scales after SLNP.
4. Post-RMSNorm Absorption and Hardware Co-Optimization
Pangu Light is adapted to the “Sandwich-Norm” architecture, which imposes RMSNorm after both attention and FFN blocks. Post-RMSNorm layers can consume 6 % throughput on Ascend NPUs. Absorption is realized by:
- Precomputing (average inverse scale) on calibration corpus.
- Absorbing into : .
- Replacing the projection matrix with .
This design eliminates extra normalization overhead at inference, producing a single fused linear layer optimized for NPU execution. Ablation studies show that norm absorption recovers all but 0.9 points of the full Sandwich-Norm performance and matches the more complex DyT (Dynamic Tanh) method.
5. Experimental Protocol and Hyperparameters
The base model for evaluation is Pangu-38B. Three principal pruning targets are selected according to simulation-optimized throughput targets: 1.6× (“Pangu Light-32B”), 2.1×, and 4.2× speedup. Importance-driven pruning ratios are determined by joint minimization of a loss metric and NPU speed simulation. Performance recovery entails continue-training for 300 B tokens (mixed 20 % instruction-following, 36 % reasoning), with knowledge distillation applied from the unpruned teacher logits and a cosine learning-rate schedule ().
Fine-tuning involves 2 M chain-of-thought (CoT) examples in two, six-epoch stages, batch sizes descending , initial LR decaying by 90 %.
6. Quantitative Performance and Ablation Analyses
6.1 Throughput on Ascend NPUs
| Model | Throughput (tokens/s) |
|---|---|
| Pangu-38B | 1631 |
| Pangu Light-1.6× | 2585 (+58.5 %) |
| Pangu Light-2.1× | 3403 |
| Pangu Light-4.2× | 6831 |
| Qwen2.5-32B | 2316 |
| Qwen3-32B | 2225 |
| Qwen3-14B | 6254 |
At the 32 B parameter scale, Pangu Light-1.6× yields a 16 % throughput gain over Qwen3-32B.
6.2 Reasoning Benchmarks
Performance on six reasoning-oriented benchmarks (AIME 2024, MATH-500, GPQA, LiveCodeBench, ArenaHard, MMLU-Pro):
| Model | Avg. Score |
|---|---|
| GPT-4o-0513 | 53.3 |
| DeepSeek-R1 | 81.8 |
| Hunyuan-T1 | 81.6 |
| Qwen3-32B | 80.9 |
| Pangu-38B | 82.0 |
| Pangu Light-1.6× | 81.6 |
| Pangu Light-2.1× | 81.1 |
| Pangu Light-4.2× | 79.6 |
Pangu Light-1.6× retains 99.5 % of Pangu-38B’s performance, exceeds Qwen3-32B by 0.7 points, with 2.1× speedup and only a 1.1 % drop in accuracy—surpassing PUZZLE’s 98.4 % retention at similar acceleration.
6.3 Ablation Studies
- CLAP and SLNP: On the Minitron baseline pruned to 11 B, CLAP improves the average score by +2.9, SLNP by a further +0.7.
- Post-Norm Absorption: Matches DyT in restoring performance (59.0 % vs Sandwich-Norm’s 59.9 %), substantially outperforming direct pruning (51.2 %).
- Parameter statistics: SLNP preserves RMSNorm scale statistics post-pruning.
This suggests that the combined use of CLAP, SLNP, and norm absorption is critical for successful deep pruning and accurate recovery.
7. Accuracy–Efficiency Trade-Off and Framework Impact
Pangu Light constructs a consistently higher accuracy–speed trade-off curve compared to Qwen3 and PUZZLE series. The framework reliably enables 2×–4× acceleration of a 38 B-parameter LLM—empirically with under 2 % average-score degradation on demanding reasoning tasks. Its integrated approach to multi-axis pruning and weight re-initialization, coupled with hardware-aware software fusion, sets a technical precedent for scalable LLM deployment under strict computational constraints (Chen et al., 26 May 2025). A plausible implication is that future LLM compression paradigms may trend toward more integrated co-design, coupling architectural, statistical, and hardware-aware methods.