Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications (2310.00867v3)

Published 2 Oct 2023 in cs.CL and cs.AI

Abstract: Compressing LLMs often leads to reduced performance, especially for knowledge-intensive tasks. In this work, we dive into how compression damages LLMs' inherent knowledge and the possible remedies. We start by proposing two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after LLM compression, hence necessitating the compressed model to (re)learn from data with additional parameters; the other presumes that knowledge is internally displaced and hence one requires merely "inference re-direction" with input-side augmentation such as prompting, to recover the knowledge-related performance. Extensive experiments are then designed to (in)validate the two conjectures. We observe the promise of prompting in comparison to model tuning; we further unlock prompting's potential by introducing a variant called Inference-time Dynamic Prompting (IDP), that can effectively increase prompt diversity without incurring any inference overhead. Our experiments consistently suggest that compared to the classical re-training alternatives such as LoRA, prompting with IDP leads to better or comparable post-compression performance recovery, while saving the extra parameter size by 21x and reducing inference latency by 60%. Our experiments hence strongly endorse the conjecture of "knowledge displaced" over "knowledge forgotten", and shed light on a new efficient mechanism to restore compressed LLM performance. We additionally visualize and analyze the different attention and activation patterns between prompted and re-trained models, demonstrating they achieve performance recovery in two different regimes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Semantic parsing on freebase from question-answer pairs. In Conference on Empirical Methods in Natural Language Processing, 2013. URL https://api.semanticscholar.org/CorpusID:6401679.
  2. Piqa: Reasoning about physical commonsense in natural language, 2019.
  3. Frugalgpt: How to use large language models while reducing cost and improving performance. ArXiv, abs/2305.05176, 2023. URL https://api.semanticscholar.org/CorpusID:258564349.
  4. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018. URL https://api.semanticscholar.org/CorpusID:3922816.
  5. Sparsegpt: Massive language models can be accurately pruned in one-shot. ArXiv, abs/2301.00774, 2023. URL https://api.semanticscholar.org/CorpusID:255372747.
  6. Gptq: Accurate post-training quantization for generative pre-trained transformers. ArXiv, abs/2210.17323, 2022. URL https://api.semanticscholar.org/CorpusID:253237200.
  7. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  8. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, 2019. URL https://api.semanticscholar.org/CorpusID:59599816.
  9. Lora: Low-rank adaptation of large language models. ArXiv, abs/2106.09685, 2021. URL https://api.semanticscholar.org/CorpusID:235458009.
  10. Accelerated sparse neural training: A provable and efficient method to find n: M transposable masks. ArXiv, abs/2102.08124, 2021a. URL https://api.semanticscholar.org/CorpusID:231934142.
  11. Accurate post training quantization with small calibration sets. In International Conference on Machine Learning, 2021b. URL https://api.semanticscholar.org/CorpusID:235825979.
  12. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017.
  13. The power of scale for parameter-efficient prompt tuning. In Conference on Empirical Methods in Natural Language Processing, 2021. URL https://api.semanticscholar.org/CorpusID:233296808.
  14. Prefix-tuning: Optimizing continuous prompts for generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), abs/2101.00190, 2021. URL https://api.semanticscholar.org/CorpusID:230433941.
  15. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:260815690.
  16. Pointer sentinel mixture models, 2016.
  17. OpenAI. Gpt-4 technical report, 2023.
  18. The lambada dataset, August 2016. URL https://doi.org/10.5281/zenodo.2630551.
  19. Adapterfusion: Non-destructive task composition for transfer learning. ArXiv, abs/2005.00247, 2020. URL https://api.semanticscholar.org/CorpusID:218470208.
  20. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  21. Learning multiple visual domains with residual adapters. ArXiv, abs/1705.08045, 2017. URL https://api.semanticscholar.org/CorpusID:215826266.
  22. Winogrande: An adversarial winograd schema challenge at scale, 2019.
  23. High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:257495837.
  24. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. URL https://api.semanticscholar.org/CorpusID:257219404.
  25. Crowdsourcing multiple choice science questions. ArXiv, abs/1707.06209, 2017. URL https://api.semanticscholar.org/CorpusID:1553193.
  26. Smoothquant: Accurate and efficient post-training quantization for large language models. ArXiv, abs/2211.10438, 2022. URL https://api.semanticscholar.org/CorpusID:253708271.
  27. Compress, then prompt: Improving accuracy-efficiency trade-off of llm inference with transferable prompt. ArXiv, abs/2305.11186, 2023. URL https://api.semanticscholar.org/CorpusID:258823240.
  28. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers, 2022.
  29. Hellaswag: Can a machine really finish your sentence?, 2019.
  30. Opt: Open pre-trained transformer language models. ArXiv, abs/2205.01068, 2022. URL https://api.semanticscholar.org/CorpusID:248496292.
Citations (2)

Summary

We haven't generated a summary for this paper yet.