Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Post Training Quantization of Large Language Models with Microscaling Formats (2405.07135v3)

Published 12 May 2024 in cs.LG and cs.AI

Abstract: LLMs have distinguished themselves with outstanding performance in complex LLMing tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to mitigate these challenges. We systematically study the combined application of three well-known post-training techniques, SmoothQuant, AWQ, and GPTQ, and provide a comprehensive analysis of their interactions and implications for advancing LLM quantization. We enhance the versatility of these methods by enabling quantization to microscaling (MX) formats, extending the applicability of these PTQ algorithms beyond their original fixed-point format targets. We show that combining different PTQ methods enables us to quantize models to 4-bit weights and 8-bit activations using the MXINT format with negligible accuracy loss compared to the uncompressed baseline.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Sayeh Sharify (14 papers)
  2. Zifei Xu (7 papers)
  3. Wanzin Yazar (5 papers)
  4. Xin Wang (1306 papers)
  5. Utkarsh Saxena (7 papers)
  6. Ilya Soloveychik (29 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets