Wanda++: Pruning Large Language Models via Regional Gradients (2503.04992v4)
Abstract: LLMs pruning seeks to remove unimportant weights for inference speedup with minimal accuracy impact. However, existing methods often suffer from accuracy degradation without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level \textbf{regional} gradients. Specifically, Wanda++ improves the pruning score with regional gradients for the first time and proposes an efficient regional optimization method to minimize pruning-induced output discrepancies between the dense and sparse decoder output. Notably, Wanda++ improves perplexity by up to 32\% over Wanda in the LLMing task and generalizes effectively to downstream tasks. Moreover, despite updating weights with regional optimization, Wanda++ remains orthogonal to sparsity-aware fine-tuning, further reducing perplexity with LoRA in great extend. Our approach is lightweight, pruning a 7B LLaMA model in under 10 minutes on a single H100 GPU.
- Yifan Yang (578 papers)
- Kai Zhen (18 papers)
- Bhavana Ganesh (5 papers)
- Aram Galstyan (142 papers)
- Goeric Huybrechts (15 papers)
- Markus Müller (114 papers)
- Jonas M. Kübler (10 papers)
- Rupak Vignesh Swaminathan (10 papers)
- Athanasios Mouchtaris (31 papers)
- Sravan Babu Bodapati (7 papers)
- Nathan Susanj (12 papers)
- Zheng Zhang (488 papers)
- Jack FitzGerald (11 papers)
- Abhishek Kumar (172 papers)