Mitigation strategies in pruning calibration pipelines

Develop effective mitigation strategies within the calibration pipeline used by unstructured pruning algorithms for large language models—specifically Magnitude pruning, SparseGPT, and Wanda—that reliably prevent pruning-triggered attacks while minimizing degradation on standard utility benchmarks. Concretely, design security-aware calibration procedures (e.g., dataset selection and scoring routines) that suppress attack activation across models, sparsity levels, and pruning configurations without incurring substantial performance loss.

Background

The paper demonstrates a pruning-activated attack in which a model appears benign before pruning but exhibits malicious behavior after users apply common unstructured pruning algorithms (Magnitude, SparseGPT, Wanda) as implemented in vLLM. To explore defenses, the authors evaluate security-aware calibration, where the pruning calibration dataset contains jailbreak queries paired with refusals.

Empirically, security-aware calibration substantially reduces attack success rates for SparseGPT but has limited effect for Wanda and introduces notable utility degradation. Consequently, the authors conclude that calibration alone is insufficient and explicitly pose the development of improved calibration-based mitigation methods as an open question.

References

Overall, security-aware calibration by itself is insufficient to reliably prevent pruning-triggered attacks in our setting. We leave methods for a better mitigation strategy in a calibration pipeline as an interesting and important open question for future work.

— Fewer Weights, More Problems: A Practical Attack on LLM Pruning (2510.07985 - Egashira et al., 9 Oct 2025) in Section 6.3 (Potential Defenses)

Mitigation strategies in pruning calibration pipelines

Background

References

Related Problems