NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models (2402.09773v2)
Abstract: The considerable size of LLMs presents notable deployment challenges, particularly on resource-constrained hardware. Structured pruning, offers an effective means to compress LLMs, thereby reducing storage costs and enhancing inference speed for more efficient utilization. In this work, we study data-efficient and resource-efficient structure pruning methods to obtain smaller yet still powerful models. Knowledge Distillation is well-suited for pruning, as the intact model can serve as an excellent teacher for pruned students. However, it becomes challenging in the context of LLMs due to memory constraints. To address this, we propose an efficient progressive Numerous-teacher pruning method (NutePrune). NutePrune mitigates excessive memory costs by loading only one intact model and integrating it with various masks and LoRA modules, enabling it to seamlessly switch between teacher and student roles. This approach allows us to leverage numerous teachers with varying capacities to progressively guide the pruned model, enhancing overall performance. Extensive experiments across various tasks demonstrate the effectiveness of NutePrune. In LLaMA-7B zero-shot experiments, NutePrune retains 97.17% of the performance of the original model at 20% sparsity and 95.07% at 25% sparsity. Our code is available at https://github.com/Lucius-lsr/NutePrune.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7028–7036.
- Lorashear: Efficient large language model structured pruning and knowledge recovery. arXiv preprint arXiv:2310.18356.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
- Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819.
- Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543.
- Compresso: Structured pruning with collaborative prompting learns compact large language models. arXiv preprint arXiv:2310.05015.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. The Journal of Machine Learning Research, 22(1):10882–11005.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Block pruning for faster transformers. arXiv preprint arXiv:2109.04838.
- Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing, 415:106–113.
- Learning sparse neural networks through l_0𝑙_0l\_0italic_l _ 0 regularization. arXiv preprint arXiv:1712.01312.
- Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627.
- Building a large annotated corpus of english: The penn treebank.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789.
- Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
- R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2:13.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33:20378–20389.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Ernie-tiny: A progressive distillation framework for pretrained transformer compression. arXiv preprint arXiv:2106.02241.
- A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
- Stanford alpaca: An instruction-following llama model.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Tim Van Erven and Peter Harremos. 2014. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820.
- Structured pruning of large language models. arXiv preprint arXiv:1910.04732.
- f-divergence minimization for sequence-level knowledge distillation. arXiv preprint arXiv:2307.15190.
- One teacher is enough? pre-trained language model distillation from multiple teachers. arXiv preprint arXiv:2106.01023.
- Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694.
- Structured pruning learns compact and accurate models. arXiv preprint arXiv:2204.00408.
- Learning from multiple experts: Self-paced knowledge distillation for long-tailed classification. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 247–263. Springer.
- Reinforced multi-teacher selection for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14284–14291.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
- Pruning meets low-rank parameter-efficient fine-tuning. arXiv preprint arXiv:2305.18403.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010.
- Shengrui Li (3 papers)
- Xueting Han (12 papers)
- Jing Bai (46 papers)
- Junzhe Chen (14 papers)