Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pruning as a Domain-specific LLM Extractor (2405.06275v1)

Published 10 May 2024 in cs.CL

Abstract: LLMs have exhibited remarkable proficiency across a wide array of NLP tasks. However, the escalation in model size also engenders substantial deployment costs. While few efforts have explored model pruning techniques to reduce the size of LLMs, they mainly center on general or task-specific weights. This leads to suboptimal performance due to lacking specificity on the target domain or generality on different tasks when applied to domain-specific challenges. This work introduces an innovative unstructured dual-pruning methodology, D-Pruner, for domain-specific compression on LLM. It extracts a compressed, domain-specific, and task-agnostic LLM by identifying LLM weights that are pivotal for general capabilities, like linguistic capability and multi-task solving, and domain-specific knowledge. More specifically, we first assess general weight importance by quantifying the error incurred upon their removal with the help of an open-domain calibration dataset. Then, we utilize this general weight importance to refine the training loss, so that it preserves generality when fitting into a specific domain. Moreover, by efficiently approximating weight importance with the refined training loss on a domain-specific calibration dataset, we obtain a pruned model emphasizing generality and specificity. Our comprehensive experiments across various tasks in healthcare and legal domains show the effectiveness of D-Pruner in domain-specific compression. Our code is available at https://github.com/psunlpgroup/D-Pruner.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Beyond efficiency: A systematic survey of resource-efficient large language models. arXiv preprint arXiv:2401.00625.
  2. Asma Ben Abacha and Dina Demner-Fushman. 2019. On the summarization of consumer health questions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2228–2234, Florence, Italy. Association for Computational Linguistics.
  3. Overview of the MEDIQA 2021 shared task on summarization in the medical domain. In Proceedings of the 20th Workshop on Biomedical Language Processing, pages 74–85, Online. Association for Computational Linguistics.
  4. JudyAnn Bigby. 1988. Harrison’s principles of internal medicine. Archives of Dermatology, 124(2):287–287.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE, 108(4):485–532.
  7. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
  8. 8-bit optimizers via block-wise quantization. arXiv preprint arXiv:2110.02861.
  9. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  10. Robustness challenges in model distillation and pruning for natural language understanding. arXiv preprint arXiv:2110.08419.
  11. Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR.
  12. Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, pages 3887–3896. PMLR.
  13. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28.
  14. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293–299. IEEE.
  15. damo_nlp at MEDIQA 2021: Knowledge-based preprocessing and coverage-oriented reranking for medical question summarization. In Proceedings of the 20th Workshop on Biomedical Language Processing, pages 112–118, Online. Association for Computational Linguistics.
  16. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  17. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  18. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China. Association for Computational Linguistics.
  19. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  20. Anastassia Kornilova and Vladimir Eidelman. 2019. BillSum: A corpus for automatic summarization of US legislation. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 48–56, Hong Kong, China. Association for Computational Linguistics.
  21. Bloom: A 176b-parameter open-access multilingual language model.
  22. Optimal brain damage. Advances in neural information processing systems, 2.
  23. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  24. Pay attention to mlps. Advances in Neural Information Processing Systems, 34:9204–9215.
  25. Full parameter fine-tuning for large language models with limited resources. arXiv preprint arXiv:2306.09782.
  26. Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627.
  27. James Martens. 2020. New insights and perspectives on the natural gradient method. The Journal of Machine Learning Research, 21(1):5776–5851.
  28. Multilegalpile: A 689gb multilingual legal corpus.
  29. Improving language understanding by generative pre-training.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints.
  31. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  32. Alexey Romanov and Chaitanya Shivade. 2018. Lessons from natural language inference in the clinical domain. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1586–1596, Brussels, Belgium. Association for Computational Linguistics.
  33. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33:20378–20389.
  34. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
  35. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 34:24193–24205.
  36. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  37. Overcoming catastrophic forgetting during domain adaptation of neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2062–2068.
  38. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  39. Sheared llama: Accelerating language model pre-training via structured pruning. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization.
  40. Structured pruning learns compact and accurate models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 1513–1528.
  41. FaMeSumm: Investigating and improving faithfulness of medical summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10915–10931, Singapore. Association for Computational Linguistics.
  42. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277.
  43. When does pretraining help? assessing self-supervised learning for law and the casehold dataset. In Proceedings of the 18th International Conference on Artificial Intelligence and Law. Association for Computing Machinery.
  44. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Nan Zhang (144 papers)
  2. Yanchi Liu (41 papers)
  3. Xujiang Zhao (26 papers)
  4. Wei Cheng (175 papers)
  5. Runxue Bao (18 papers)
  6. Rui Zhang (1138 papers)
  7. Prasenjit Mitra (58 papers)
  8. Haifeng Chen (99 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com