Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Model Compression for Large Language Models (2308.07633v4)

Published 15 Aug 2023 in cs.CL and cs.AI
A Survey on Model Compression for Large Language Models

Abstract: LLMs have transformed natural language processing tasks successfully. Yet, their large size and high computational needs pose challenges for practical use, especially in resource-limited settings. Model compression has emerged as a key research area to address these challenges. This paper presents a survey of model compression techniques for LLMs. We cover methods like quantization, pruning, and knowledge distillation, highlighting recent advancements. We also discuss benchmarking strategies and evaluation metrics crucial for assessing compressed LLMs. This survey offers valuable insights for researchers and practitioners, aiming to enhance efficiency and real-world applicability of LLMs while laying a foundation for future advancements.

Overview of Model Compression for LLMs

This paper provides a comprehensive survey on the methodologies specifically devised for the compression of LLMs. It thoroughly investigates different techniques such as quantization, pruning, and knowledge distillation, which aim to mitigate the significant computational demands and storage requirements characteristic of LLMs. The survey further introduces a nuanced taxonomy that organizes these methodologies, offering insights into recent successes and emerging approaches in LLM compression.

Model Compression Techniques Discussed in the Paper

  1. Pruning: The paper elaborates on both unstructured and structured pruning. Unstructured pruning removes individual weights and is applauded for substantial size reduction with minimal performance decline, demonstrated by methods such as SparseGPT and Wanda. Structured pruning, involving removal of entire structural components like neurons or channels, also indicates promise, yet highlights the work required to align these methods with LLMs' unique characteristics.
  2. Knowledge Distillation (KD): The paper classifies KD into white-box and black-box categories, detailing white-box approaches that utilize teacher model parameters and black-box methods pertinent to emergent abilities like in-context learning and chain-of-thought.
  3. Quantization: This technique is categorized into Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ). Both approaches aim to decrease the precision of model weights and activations, discussed extensively with examples of significant progress in both 8-bit and lower-bit quantization in tools like QLORA and GPTQ.
  4. Low-Rank Factorization: Although less prevalent, low-rank factorization is explored for its efficacy in compressing models without considerable performance compromise. The TensorGPT project exemplifies this approach with promising results on embedding layers.

Evaluation and Benchmarking

The effectiveness of compressing LLMs is assessed through metrics such as model size, number of parameters, inference time, and floating-point operations (FLOPs). Widely recognized NLP datasets such as GLUE, LAMBADA, and strategy-specific datasets like BIG-Bench and unseen instructions datasets facilitate the benchmarking process, offering thorough performance comparisons against uncompressed models.

Impact and Implications

Model compression holds robust promise for making LLMs more accessible and feasible for deployment in environments with constrained resources. This survey nudges researchers towards important insights and future opportunities, emphasizing the necessity for better performance-size trade-offs and more dynamically adaptive model architectures, potentially guided by methods like Neural Architecture Search (NAS).

Furthermore, the survey acknowledges the emerging need for explainable model compression, particularly pivotal for understanding and validating the impact of changes on LLMs' performance and ensuring their reliability in practical applications.

Concluding Remarks

The document serves as a valuable reference for scholars and practitioners aiming to navigate the complex landscape of LLM compression. By charting current methodologies and hinting at future research directions, it lays a comprehensive foundation for advancing LLM efficiency while maintaining their formidable natural language processing capabilities. The survey fosters continued exploration into solving LLM's inherent challenges, promoting eco-friendly and inclusive AI development and deployment.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xunyu Zhu (7 papers)
  2. Jian Li (667 papers)
  3. Yong Liu (721 papers)
  4. Can Ma (21 papers)
  5. Weiping Wang (123 papers)
Citations (129)
Youtube Logo Streamline Icon: https://streamlinehq.com