Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information (2405.17470v1)

Published 24 May 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs have significantly advanced natural language processing tasks such as machine translation, text generation, and sentiment analysis. However, their large size, often consisting of billions of parameters, poses challenges for storage, computation, and deployment, particularly in resource-constrained environments like mobile devices and edge computing platforms. Effective compression and quantization techniques are crucial for addressing these issues, reducing memory footprint and computational requirements without significantly compromising performance. Traditional methods that uniformly map parameters to compressed spaces fail to account for the uneven distribution of parameters, leading to substantial accuracy loss. In this work, we propose Athena, a novel algorithm for efficient block-wise post-training quantization of LLMs. Athena leverages Second-Order Matrix Derivative Information to guide the quantization process using the curvature information of the loss landscape. By grouping parameters by columns or rows and iteratively optimizing the quantization process, Athena updates the model parameters and Hessian matrix to achieve significant compression while maintaining high accuracy. This makes Athena a practical solution for deploying LLMs in various settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. Robust quantization: One model to rule them all // Advances in neural information processing systems. 2020. 33. 5308–5317.
  2. Llm. int8 (): 8-bit matrix multiplication for transformers at scale // CoRR abs/2208.07339. 2022.
  3. Training with quantization noise for extreme model compression // arXiv preprint arXiv:2004.07320. 2020.
  4. Gptq: Accurate post-training quantization for generative pre-trained transformers // arXiv preprint arXiv:2210.17323. 2022.
  5. Perplexity—a measure of the difficulty of speech recognition tasks // The Journal of the Acoustical Society of America. 1977. 62, S1. S63–S63.
  6. Mistral 7B // arXiv preprint arXiv:2310.06825. 2023.
  7. Optimal brain damage // Advances in neural information processing systems. 1989. 2.
  8. Awq: Activation-aware weight quantization for llm compression and acceleration // arXiv preprint arXiv:2306.00978. 2023.
  9. Pointer sentinel mixture models // arXiv preprint arXiv:1609.07843. 2016.
  10. Model compression via distillation and quantization // arXiv preprint arXiv:1802.05668. 2018.
  11. Omniquant: Omnidirectionally calibrated quantization for large language models // arXiv preprint arXiv:2308.13137. 2023.
  12. Llama: Open and efficient foundation language models // arXiv preprint arXiv:2302.13971. 2023a.
  13. Llama 2: Open foundation and fine-tuned chat models // arXiv preprint arXiv:2307.09288. 2023b.
  14. Xtc: Extreme compression for pre-trained transformers made simple and efficient // Advances in Neural Information Processing Systems. 2022. 35. 3217–3231.
  15. Q8bert: Quantized 8bit bert // 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS). 2019. 36–39.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yanshu Wang (6 papers)
  2. Wenyang He (4 papers)
  3. Tong Yang (154 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets