Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Comprehensive Survey of Compression Algorithms for Language Models (2401.15347v1)

Published 27 Jan 2024 in cs.CL and cs.AI
A Comprehensive Survey of Compression Algorithms for Language Models

Abstract: How can we compress LLMs without sacrificing accuracy? The number of compression algorithms for LLMs is rapidly growing to benefit from remarkable advances of recent LLMs without side effects due to the gigantic size of LLMs, such as increased carbon emissions and expensive maintenance fees. While numerous compression algorithms have shown remarkable progress in compressing LLMs, it ironically becomes challenging to capture emerging trends and identify the fundamental concepts underlying them due to the excessive number of algorithms. In this paper, we survey and summarize diverse compression algorithms including pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design. We not only summarize the overall trend of diverse compression algorithms but also select representative algorithms and provide in-depth analyses of them. We discuss the value of each category of compression algorithms, and the desired properties of low-cost compression algorithms which have a significant impact due to the emergence of LLMs. Finally, we introduce promising future research topics based on our survey results.

Overview

The landscape of LLM (LM) compression is vast, with an array of algorithms vying to reduce the size and computational demands of these models without impinging on their accuracy. This paper presents a comprehensive overview of such algorithms, including pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design. The analysis here explores the intricacies of each approach, evaluates their performance, and compares their effectiveness. Discerning the nuances between high-cost and low-cost approaches, the paper also underscores critical attributes that successful LM compression algorithms should possess.

Representative Compression Algorithms

Among the several algorithms surveyed, a few stand out for their contribution to the field. SparseGPT makes significant strides in pruning methodologies, successfully handling LLMs and extending its pruning technique to accommodate semi-structured sparsity patterns like 2:4 sparsity. The algorithm optimizes weight pruning using optimal brain damage (OBD)-derived strategies and notably curtails the computational demands of the Hessian inversion.

In quantization, OPTQ emerges as a potent tool for compressing the colossal parameter matrices of LLMs. The key to OPTQ's success lies in its round-to-nearest approximation that mitigates degradation in precision by optimizing weight adjustments post-quantization. Its strength is further bolstered by improvements from subsequent works that refine its approach to minimize accuracy loss, especially in dealing with activations.

Low-Rank Adaptation (LoRA) is identified as a pivotal method for fine-tuning LMs while minimally updating parameters, thereby reducing the memory overhead traditionally associated with fine-tuning large models. By focusing on low-rank matrices, LoRA demonstrates efficiency by saving costs in gradient descent processes, marking it as essential for enhancing LMs.

Desired Properties

The paper highlights two critical properties that successful low-cost LM compression algorithms must possess. Firstly, direct incorporation of task-specific objective functions is vital; proxy objectives used for local layer-wise reconstruction errors can lead to suboptimal results. Secondly, an iterative compression process proves advantageous in mitigating errors at each iteration, thereby preserving the innate knowledge acquired during pre-training.

Future Research

Looking ahead, several promising research areas have been identified. The quest for efficient iterative algorithms that could further enhance the accuracy of compressed models remains critical, especially for LLMs where traditional retraining processes are resource-prohibitive. Effective strategies for directly optimizing the target objection function, quantizing activations of LLMs, and unifying diverse compression algorithms pave the way for future innovation. The fusion of PEFT with traditional high-cost algorithms holds particular promise for reducing the cost of fine-tuning while maintaining accuracy.

Conclusion

The survey concludes that the amalgamation of various compression techniques could lead to unprecedented compression rates for LLMs, particularly for the increasingly relevant LLMs. The findings and discussions encapsulated in this paper aim to steer future developments in the field, promoting both cost-effective and performance-optimized compression avenues. The ultimate aim is to democratize access to advanced AI capabilities by making LLMs more resource-efficient and, thus, more widely deployable.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (174)
  1. Abien Fred Agarap. 2018. Deep Learning using Rectified Linear Units (ReLU). arXiv (2018).
  2. Knowledge Distillation from Internal Representations. In AAAI 2020. 7350–7357.
  3. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv (2023).
  4. Program Synthesis with Large Language Models. arXiv (2021).
  5. Layer Normalization. arXiv (2016).
  6. Towards Efficient Post-training Quantization of Pre-trained Language Models. In NeurIPS.
  7. BinaryBERT: Pushing the Limit of BERT Quantization. In IJCNLP 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). ACL, 4334–4348.
  8. QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm. arXiv (2023).
  9. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arXiv (2013).
  10. Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model. arXiv (2019).
  11. Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing. arXiv (2023).
  12. Language Models are Few-Shot Learners. In NeurIPS 2020.
  13. SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation. arXiv (2017).
  14. INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation. arXiv (2023).
  15. QuIP: 2-Bit Quantization of Large Language Models With Guarantees. arXiv (2023).
  16. AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search. In IJCAI 2020, Christian Bessiere (Ed.). 2463–2469.
  17. Parameter-Efficient Fine-Tuning Design Spaces. In ICLR.
  18. Evaluating Large Language Models Trained on Code. arXiv (2021).
  19. DRONE: Data-aware Low-rank Compression for Large NLP Models. In NeurIPS 2021, Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.).
  20. A Simple Framework for Contrastive Learning of Visual Representations. In ICML.
  21. Long Short-Term Memory-Networks for Machine Reading. In EMNLP 2016, Jian Su, Xavier Carreras, and Kevin Duh (Eds.). ACL.
  22. Generating Long Sequences with Sparse Transformers. arXiv (2019).
  23. Ikhyun Cho and U Kang. 2022. Pea-KD: Parameter-efficient and accurate Knowledge Distillation on BERT. PLOS ONE 17, 2 (02 2022), 1–12.
  24. A comprehensive survey on model compression and acceleration. AIR 53, 7 (2020), 5113–5155.
  25. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In ICLR 2020.
  26. Training Verifiers to Solve Math Word Problems. arXiv (2021).
  27. Multi-Head Attention: Collaborate Instead of Concatenate. arXiv (2020).
  28. Raj Dabre and Atsushi Fujita. 2019. Recurrent Stacking of Layers for Compact Neural Machine Translation Models. In AAAI 2019. 6292–6299.
  29. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. In NeurIPS 2020.
  30. Analyzing Redundancy in Pretrained Transformer Models. In EMNLP, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.).
  31. Universal Transformers. In ICLR 2019.
  32. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey. IEEE 108, 4 (2020), 485–532.
  33. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv (2022).
  34. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv (2023).
  35. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. arXiv (2023).
  36. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT 2019, Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). ACL, 4171–4186.
  37. William B. Dolan and Chris Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. In IWP 2005. AFNLP.
  38. EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation. In EMNLP 2021. ACL, 1424–1437.
  39. Jonathan Frankle and Michael Carbin. 2019. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In ICLR.
  40. Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv (2023).
  41. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv (2022).
  42. OPTQ: Accurate Quantization for Generative Pre-trained Transformers. In ICLR 2023.
  43. LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding. In AAAI 2021. 12830–12838.
  44. Compressing Large-Scale Transformer-Based Models: A Case Study on BERT. TACL 9 (2021), 1061–1080.
  45. A Survey of Quantization Methods for Efficient Neural Network Inference. arXiv (2021).
  46. PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models. In ACL 2023, Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (Eds.). 8065–8079.
  47. PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination. In ICML 2020 (PMLR, Vol. 119). 3690–3699.
  48. Knowledge Distillation of Large Language Models. arXiv (2023).
  49. Manish Gupta and Puneet Agrawal. 2022. Compression of Deep Learning Models for Text: A Survey. ACM TKDD 16, 4 (2022), 61:1–55.
  50. RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation. In NAACL 2022. ACL, 1389–1400.
  51. Learning both Weights and Connections for Efficient Neural Networks. arXiv (2015).
  52. Babak Hassibi and David G. Stork. 1992. Second Order Derivatives for Network Pruning: Optimal Brain Surgeon. In NeurIPS 1992], Stephen Jose Hanson, Jack D. Cowan, and C. Lee Giles (Eds.). Morgan Kaufmann, 164–171.
  53. Measuring Mathematical Problem Solving With the MATH Dataset. In NeurIPS Datasets and Benchmarks 2021, Joaquin Vanschoren and Sai-Kit Yeung (Eds.).
  54. Dan Hendrycks and Kevin Gimpel. 2016. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. arXiv (2016).
  55. Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. JMLR 22 (2021), 241:1–124.
  56. DynaBERT: Dynamic BERT with Adaptive Width and Depth. In NeurIPS 2020, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).
  57. Parameter-Efficient Transfer Learning for NLP. In ICML 2019. PMLR, 2790–2799.
  58. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv (2017).
  59. Language model compression with weighted low-rank factorization. In ICLR 2022.
  60. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR 2022.
  61. Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML, Francis R. Bach and David M. Blei (Eds.).
  62. Mr.BiQ: Post-Training Non-Uniform Quantization based on Minimizing the Reconstruction Error. In CVPR 2022. IEEE, 12319–12328.
  63. How To Train Your (Compressed) Large Language Model. arXiv (2023).
  64. On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers. In IJCNLP 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). 4147–4157.
  65. TinyBERT: Distilling BERT for Natural Language Understanding. In EMNLP 2020 (ACL), Trevor Cohn, Yulan He, and Yang Liu (Eds.). 4163–4174.
  66. KDLSQ-BERT: A Quantized Bert Combining Knowledge Distillation with Learned Step Size Quantization. arXiv (2021).
  67. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In ACL 2017, Regina Barzilay and Min-Yen Kan (Eds.). 1601–1611.
  68. Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization. arXiv (2023).
  69. Tutoring Helps Students Learn Better: Improving Knowledge Distillation for BERT with Tutor Network. In EMNLP 2022. ACL, 7371–7382.
  70. I-BERT: Integer-only BERT Quantization. In ICML 2021 (PMLR, Vol. 139), Marina Meila and Tong Zhang (Eds.). 5506–5518.
  71. SqueezeLLM: Dense-and-Sparse Quantization. arXiv (2023).
  72. Full Stack Optimization of Transformer Inference: a Survey. arXiv (2023).
  73. Learned Token Pruning for Transformers. In KDD 2022, Aidong Zhang and Huzefa Rangwala (Eds.). ACM, 784–794.
  74. FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs. arXiv (2023).
  75. Reformer: The Efficient Transformer. In ICLR 2020.
  76. The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models. In EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). ACL, 4163–4181.
  77. ZipLM: Hardware-Aware Structured Pruning of Language Models. arXiv (2023).
  78. FP8 Quantization: The Power of the Exponent. In NeurIPS.
  79. Natural Questions: a Benchmark for Question Answering Research. TACL 7 (2019), 452–466.
  80. AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models. In EMNLP 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). 3288–3305.
  81. A Fast Post-Training Pruning Framework for Transformers. In NeurIPS.
  82. Block Pruning For Faster Transformers. In EMNLP 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). ACL, 10619–10629.
  83. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In ICLR 2020.
  84. Optimal Brain Damage. In NeurIPS, David S. Touretzky (Ed.). Morgan Kaufmann.
  85. Masking Adversarial Damage: Finding Adversarial Saliency for Robust and Sparse Network. In CVPR 2022.
  86. OWQ: Lessons learned from activation outliers for weight quantization in large language models. arXiv (2023).
  87. AUBER: Automated BERT Regularization. arXiv abs/2009.14409 (2020).
  88. BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover’s Distance. In EMNLP 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). ACL, 3009–3018.
  89. Norm Tweaking: High-performance Low-bit Quantization of Large Language Models. arXiv (2023).
  90. Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In ACL/IJCNLP 2021. ACL, 4582–4597.
  91. LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models. arXiv (2023).
  92. Less is More: Task-aware Layer-wise Distillation for Language Model Compression. In ICML 2023 (PMLR, Vol. 202). 20852–20867.
  93. MixKD: Towards Efficient Distillation of Large-scale Language Models. In ICLR 2021.
  94. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv (2023).
  95. Understanding Parameter Sharing in Transformers. arXiv (2023).
  96. Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior. In EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). ACL, 719–730.
  97. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv (2019).
  98. Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022. IEEE, 4932–4942.
  99. EBERT: Efficient BERT Inference with Dynamic Structured Pruning. In IJCNLP 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). ACL, 4814–4823.
  100. BiT: Robustly Binarized Multi-distilled Transformer. In NeurIPS.
  101. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv (2023).
  102. Learning Sparse Neural Networks through L00{}_{\mbox{0}}start_FLOATSUBSCRIPT 0 end_FLOATSUBSCRIPT Regularization. arXiv (2017).
  103. LightFormer: Light-weight Transformer Using SVD-based Weight Transfer and Parameter Sharing. In ACL 2023, Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (Eds.). 10323–10335.
  104. LLM-Pruner: On the Structural Pruning of Large Language Models. arXiv (2023).
  105. UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning. In ACL 2022. ACL, 6253–6264.
  106. Are Sixteen Heads Really Better than One?. In NeurIPS 2019, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 14014–14024.
  107. FP8 Formats for Deep Learning. arXiv (2022).
  108. Matan Ben Noach and Yoav Goldberg. 2020. Compressing Pre-trained Language Models by Matrix Decomposition. In IJCNLP 2020, Kam-Fai Wong, Kevin Knight, and Hua Wu (Eds.). ACL, 884–889.
  109. Gradient-Free Structured Pruning with Unlabeled Data. In ICML (PMLR, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). 26326–26341.
  110. A Decomposable Attention Model for Natural Language Inference. In EMNLP 2016, Jian Su, Xavier Carreras, and Kevin Duh (Eds.). ACL.
  111. Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models. In ICLR.
  112. ALP-KD: Attention-Based Layer Projection for Knowledge Distillation. In AAAI 2021. 13657–13665.
  113. SensiMix: Sensitivity-Aware 8-bit index & 1-bit value mixed precision quantization for BERT compression. PloS one 17, 4 (2022).
  114. BiBERT: Accurate Fully Binarized BERT. In ICLR 2022.
  115. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR 21 (2020), 140:1–67.
  116. Know What You Don’t Know: Unanswerable Questions for SQuAD. In ACL 2018, Iryna Gurevych and Yusuke Miyao (Eds.). 784–789.
  117. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP 2016, Jian Su, Xavier Carreras, and Kevin Duh (Eds.). ACL.
  118. Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers. In EMNLP 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). ACL, 4081–4090.
  119. On the effect of dropping layers of pre-trained transformer models. CSL 77 (2023), 101429.
  120. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv (2019).
  121. Movement Pruning: Adaptive Sparsity by Fine-Tuning. In NeurIPS 2020, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).
  122. Winning the Lottery with Continuous Sparsification. In NeurIPS, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).
  123. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv (2022).
  124. Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need. arXiv (2019).
  125. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. In AAAI 2020. 8815–8821.
  126. The Evolved Transformer. In ICML 2019 (PMLR, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). 5877–5886.
  127. Training with Quantization Noise for Extreme Model Compression. In ICLR 2021.
  128. A Simple and Effective Pruning Approach for Large Language Models. arXiv (2023).
  129. Patient Knowledge Distillation for BERT Model Compression. In IJCNLP 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). ACL, 4322–4331.
  130. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. In ACL 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). 2158–2170.
  131. Sho Takase and Shun Kiyono. 2021. Lessons on Parameter Sharing across Layers in Transformers. arXiv (2021).
  132. MKQ-BERT: Quantized BERT with 4-bits Weights and Activations. arXiv (2022).
  133. Structured Pruning for Efficient Generative Pre-trained Language Models. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, 10880–10895.
  134. Compression of Generative Pre-trained Language Models via Quantization. In ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). 4821–4836.
  135. LLaMA: Open and Efficient Foundation Language Models. arXiv (2023).
  136. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv (2023).
  137. DyLoRA: Parameter-Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation. In EACL 2023, Andreas Vlachos and Isabelle Augenstein (Eds.). ACL, 3266–3279.
  138. Attention is All you Need. In NeurIPS 2017, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.).
  139. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In ICLR 2019.
  140. Exploring extreme parameter compression for pre-trained language models. In ICLR 2022.
  141. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing. In ACL 2020. 7675–7688.
  142. Linformer: Self-Attention with Linear Complexity. arXiv (2020).
  143. MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers. In IJCNLP 2021 (ACL), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). 2140–2151.
  144. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. In NeurIPS 2020, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).
  145. Structured Pruning of Large Language Models. In EMNLP 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). ACL, 6151–6162.
  146. Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models. In NeurIPS.
  147. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In NAACL-HLT 2018, Marilyn A. Walker, Heng Ji, and Amanda Stent (Eds.). ACL, 1112–1122.
  148. One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers. In IJCNLP 2021 (ACL), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). 4408–4413.
  149. Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation. arXiv (2020).
  150. Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases. arXiv (2023).
  151. ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats. arXiv (2023).
  152. Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers. In EMNLP 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). ACL, 1016–1021.
  153. Universal-KD: Attention-based Output-Grounded Intermediate Layer Knowledge Distillation. In EMNLP 2021. ACL, 7649–7661.
  154. Structured Pruning Learns Compact and Accurate Models. In ACL 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). 1513–1528.
  155. Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder. In AAAI 2019. 5466–5473.
  156. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. In ICML 2023 (PMLR, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). 38087–38099.
  157. NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search. In KDD 2021, Feida Zhu, Beng Chin Ooi, and Chunyan Miao (Eds.). ACM, 1933–1943.
  158. From Dense to Sparse: Contrastive Pruning for Better Pre-trained Language Model Compression. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 11547–11555.
  159. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. In NeurIPS.
  160. ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation. arXiv (2023).
  161. LEAP: Learnable Pruning for Transformer-based Models. arXiv (2022).
  162. Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity. arXiv (2023).
  163. AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models. In IJCNLP 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). ACL, 5146–5157.
  164. RPTQ: Reorder-based Post-training Quantization for Large Language Models. arXiv (2023).
  165. NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search. arXiv (2023).
  166. PowerQuant: Automorphism Search for Non-Uniform Quantization. In ICLR 2023.
  167. GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference. In MICRO 2020. IEEE, 811–824.
  168. Q8BERT: Quantized 8Bit BERT. In NeurIPS 2019. 36–39.
  169. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In ACL 2022. ACL.
  170. LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning. arXiv (2023).
  171. Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. arXiv (2023).
  172. OPT: Open Pre-trained Transformer Language Models. arXiv (2022).
  173. TernaryBERT: Distillation-aware Ultra-low Bit BERT. In EMNLP 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). ACL, 509–521.
  174. An Investigation on Different Underlying Quantization Schemes for Pre-trained Language Models. In NLPCC 2020 (LNCS, Vol. 12430), Xiaodan Zhu, Min Zhang, Yu Hong, and Ruifang He (Eds.). Springer, 359–371.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Seungcheol Park (5 papers)
  2. Jaehyeon Choi (3 papers)
  3. Sojin Lee (5 papers)
  4. U Kang (43 papers)
Citations (9)