Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation (2306.11222v2)

Published 20 Jun 2023 in cs.LG and cs.CL

Abstract: Transformer models have achieved remarkable results in various natural language tasks, but they are often prohibitively large, requiring massive memories and computational resources. To reduce the size and complexity of these models, we propose LoSparse (Low-Rank and Sparse approximation), a novel model compression technique that approximates a weight matrix by the sum of a low-rank matrix and a sparse matrix. Our method combines the advantages of both low-rank approximations and pruning, while avoiding their limitations. Low-rank approximation compresses the coherent and expressive parts in neurons, while pruning removes the incoherent and non-expressive parts in neurons. Pruning enhances the diversity of low-rank approximations, and low-rank approximation prevents pruning from losing too many expressive neurons. We evaluate our method on natural language understanding, question answering, and natural language generation tasks. We show that it significantly outperforms existing compression methods.

An Overview of "LoSparse: Structured Compression of LLMs based on Low-Rank and Sparse Approximation"

The computational demands imposed by the vast parameter space of large transformer-based models necessitate innovative approaches to reduce their size without a significant loss in performance. In the paper titled "LoSparse: Structured Compression of LLMs based on Low-Rank and Sparse Approximation," the authors introduce LoRaS, a novel model compression technique designed to address these challenges.

Technical Approach

LoRaS innovatively employs both low-rank approximation and structured pruning to compress transformer models. The method strategically decomposes the weight matrices into a combination of a low-rank representation and a sparse component. This dual approach confers several benefits:

  • Expressive Compression: The low-rank matrix captures and compresses the coherent, expressive parts of the weight matrices. This is crucial as it preserves the model's ability to generalize and maintain performance across various tasks.
  • Structured Pruning: The sparse matrix prunes non-expressive parts, essentially filtering out unnecessary neurons, thus enabling a more efficient weight matrix representation. This type of structured pruning targets redundancy, reducing the model size while avoiding a complete removal of intrinsically valuable neurons.

Evaluation and Results

The performance of LoRaS is evaluated across a set of diverse natural language processing tasks, including natural language understanding (NLU), question answering (QA), and natural language generation (NLG). The paper reports significant improvements over existing pruning and low-rank approximation methods in several key benchmarks:

  • Natural Language Understanding: On the GLUE benchmark, LoRaS achieved marked improvements over iterative and movement pruning methods. For instance, on the MNLI dataset with only 10% of the model retained, LoRaS achieved an accuracy improvement of over 2 percentage points compared to the best existing methods.
  • Question Answering: In SQuAD v1.1 dataset evaluations, LoRaS consistently outperformed existing techniques, indicating its robustness in scenarios where high sparsity is necessary. With a mere 5% parameter retention, LoRaS still outperformed iterative pruning by 3% in F1 score.
  • Natural Language Generation: For summarization tasks on the XSum dataset, the superiority of LoRaS was further demonstrated, with gains of nearly 3 ROUGE-1 points over the best performing baseline method at a 30% remaining ratio.

Theoretical Implications

Theoretically, LoRaS elucidates the capacity of low-rank approximations to maintain the coherence of neuron activities through a shared subspace. The incorporation of structured sparsity mitigates the limitations of low-rank methods in approximating diverse model behaviors. This synergy is essential for balancing model compression with the retention of critical task-specific capabilities.

Practical Implications and Future Directions

Practically, LoRaS offers a promising direction for deploying LLMs in resource-constrained environments, where maintaining computational efficiency and memory usage is crucial. The method's ability to pair effectively with other performance-enhancing techniques, such as knowledge distillation and CoFi, highlights its flexibility and potential for broader application in model optimization strategies.

Looking forward, further advancements could explore adaptive or dynamic adjustments between low-rank and sparse components throughout training or usage cycles, optimizing their balance based on emerging requirements or task complexities. Additionally, exploring applications beyond NLP, such as computer vision and speech recognition, could solidify LoRaS as a versatile framework in the field of AI model compression.

In summary, LoRaS represents a significant step towards more efficient large-scale model deployment. Its thoughtful integration of low-rank and sparse approximations demonstrates that high compression rates need not necessarily come at the expense of performance, heralding a new era of scalable, efficient transformer models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. The fifth pascal recognizing textual entailment challenge. In TAC, 2009.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp.  1–14, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/S17-2001.
  4. Dsee: Dually sparsity-embedded efficient tuning of pre-trained language models. arXiv preprint arXiv:2111.00160, 2021.
  5. The pascal recognising textual entailment challenge. In Quiñonero-Candela, J., Dagan, I., Magnini, B., and d’Alché Buc, F. (eds.), Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pp.  177–190, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. ISBN 978-3-540-33428-6.
  6. The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, 2007.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  8. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
  9. Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019.
  10. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pp.  1–9, Prague, June 2007. Association for Computational Linguistics.
  11. Compressing pre-trained language models using progressive low rank decomposition. 2021.
  12. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  13. Low-rank+ sparse tensor compression for neural networks. arXiv preprint arXiv:2111.01697, 2021.
  14. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
  15. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, 2021.
  16. Teaching machines to read and comprehend. Advances in neural information processing systems, 28, 2015.
  17. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
  18. Language model compression with weighted low-rank factorization. arXiv preprint arXiv:2207.00112, 2022a.
  19. Language model compression with weighted low-rank factorization. ArXiv, abs/2207.00112, 2022b.
  20. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  21. A dirty model for multi-task learning. Advances in neural information processing systems, 23, 2010.
  22. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021.
  23. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  24. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7871–7880, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703.
  25. Super tickets in pre-trained language models: From model compression to improving generalization. arXiv preprint arXiv:2105.12002, 2021.
  26. Mixkd: Towards efficient distillation of large-scale language models. ArXiv, abs/2011.00593, 2020.
  27. Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
  28. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  29. Learning sparse neural networks through l⁢_⁢0𝑙_0l\_0italic_l _ 0 regularization. arXiv preprint arXiv:1712.01312, 2017.
  30. Structured pruning of a bert-based question answering model. arXiv preprint arXiv:1910.06360, 2019.
  31. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
  32. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11264–11272, 2019.
  33. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  119–126, 2020.
  34. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745, 2018.
  35. Compressing pre-trained language models by matrix decomposition. In AACL, 2020.
  36. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
  37. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  38. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  2383–2392, Austin, Texas, November 2016a. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264.
  39. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016b.
  40. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  41. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33:20378–20389, 2020.
  42. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.  1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics.
  43. Patient knowledge distillation for bert model compression. In Conference on Empirical Methods in Natural Language Processing, 2019.
  44. Contrastive distillation on intermediate representations for language model compression. In Conference on Empirical Methods in Natural Language Processing, 2020.
  45. Kroneckerbert: Learning kronecker decomposition for pre-trained language models via knowledge distillation. arXiv preprint arXiv:2109.06243, 2021.
  46. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019.
  47. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019. doi: 10.1162/tacl˙a˙00290.
  48. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101.
  49. Structured pruning learns compact and accurate models. arXiv preprint arXiv:2204.00408, 2022.
  50. Bert-of-theseus: Compressing bert by progressive module replacing. In Conference on Empirical Methods in Natural Language Processing, 2020.
  51. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  7370–7379, 2017.
  52. Platon: Pruning large transformer models with upper confidence bound of weight importance. In International Conference on Machine Learning, pp. 26809–26823. PMLR, 2022.
  53. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yixiao Li (14 papers)
  2. Yifan Yu (18 papers)
  3. Qingru Zhang (15 papers)
  4. Chen Liang (140 papers)
  5. Pengcheng He (60 papers)
  6. Weizhu Chen (128 papers)
  7. Tuo Zhao (131 papers)
Citations (48)
Youtube Logo Streamline Icon: https://streamlinehq.com