Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data-freeWeight Compress and Denoise for Large Language Models (2402.16319v1)

Published 26 Feb 2024 in cs.CL

Abstract: LLMs are reshaping the research landscape in artificial intelligence, particularly as model parameters scale up significantly, unlocking remarkable capabilities across various domains. Nevertheless, the scalability of model parameters faces constraints due to limitations in GPU memory and computational speed. To address these constraints, various weight compression methods have emerged, such as Pruning and Quantization. Given the low-rank nature of weight matrices in LLMs, the reduction of weights through matrix decomposition undoubtedly holds significant potential and promise. In this paper, drawing upon the intrinsic structure of LLMs, we propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices. Significantly, our method is characterized by without necessitating additional involvement of any corpus, while simultaneously preserving orthogonality in conjunction with pruning and quantization methods. We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data. Additionally, we explore the fundamental properties of the weight matrix of LLMs undergone Rank-k Approximation and conduct comprehensive experiments to elucidate our hypothesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
  2. Low-rank bottleneck in multi-head attention models, 2020.
  3. Piqa: Reasoning about physical commonsense in natural language, 2019.
  4. Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019.
  5. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
  6. Contributors, O. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  7. Qlora: Efficient finetuning of quantized llms, 2023.
  8. Attention is not all you need: Pure attention loses rank doubly exponentially with depth, 2023.
  9. The approximation of one matrix by another of lower rank. Psychometrika, 1(3):211–218, 1936.
  10. Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023.
  11. Transformer feed-forward layers are key-value memories, 2021.
  12. Lora: Low-rank adaptation of large language models, 2021.
  13. How to train your (compressed) large language model, 2023.
  14. Mistral 7b, 2023.
  15. Adam: A method for stochastic optimization, 2017.
  16. A watermark for large language models, 2023.
  17. Awq: Activation-aware weight quantization for llm compression and acceleration, 2023.
  18. LightFormer: Light-weight transformer using SVD-based weight transfer and parameter sharing. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  10323–10335, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.656. URL https://aclanthology.org/2023.findings-acl.656.
  19. Llm-pruner: On the structural pruning of large language models, 2023.
  20. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018.
  21. In-context learning and induction heads, 2022.
  22. Value-aware quantization for training and inference of neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  580–595, 2018.
  23. Train short, test long: Attention with linear biases enables input length extrapolation, 2022.
  24. A stochastic approximation method. The annals of mathematical statistics, pp.  400–407, 1951.
  25. Winogrande: An adversarial winograd schema challenge at scale, 2019.
  26. The truth is in there: Improving reasoning in language models with layer-selective rank reduction, 2023.
  27. Shazeer, N. Fast transformer decoding: One write-head is all you need, 2019.
  28. Q-bert: Hessian based ultra low precision quantization of bert, 2019.
  29. Roformer: Enhanced transformer with rotary position embedding, 2023.
  30. A simple and effective pruning approach for large language models, 2023.
  31. Llama: Open and efficient foundation language models, 2023a.
  32. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  33. Transformer dissection: A unified understanding of transformer’s attention via the lens of kernel, 2019.
  34. Attention is all you need, 2023.
  35. Sheared llama: Accelerating language model pre-training via structured pruning, 2023.
  36. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers, 2022.
  37. Hellaswag: Can a machine really finish your sentence?, 2019.
  38. Loraprune: Pruning meets low-rank parameter-efficient fine-tuning, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Runyu Peng (4 papers)
  2. Yunhua Zhou (27 papers)
  3. Qipeng Guo (72 papers)
  4. Yang Gao (761 papers)
  5. Hang Yan (86 papers)
  6. Xipeng Qiu (257 papers)
  7. Dahua Lin (336 papers)
Citations (1)