Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More (2402.12065v2)

Published 19 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers. We critically analyze the existing quantization approaches, identifying their limitations in balancing the accuracy and efficiency of the quantized LLMs. To advance beyond these limitations, we propose WKVQuant, a PTQ framework especially designed for quantizing weights and the key/value (KV) cache of LLMs. Specifically, we incorporates past-only quantization to improve the computation of attention. Additionally, we introduce two-dimensional quantization strategy to handle the distribution of KV cache, along with a cross-block reconstruction regularization for parameter optimization. Experiments show that WKVQuant achieves almost comparable memory savings to weight-activation quantization, while also approaching the performance of weight-only quantization.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yuxuan Yue (4 papers)
  2. Zhihang Yuan (45 papers)
  3. Haojie Duanmu (5 papers)
  4. Sifan Zhou (24 papers)
  5. Jianlong Wu (38 papers)
  6. Liqiang Nie (191 papers)
Citations (25)