Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attention-aware Post-training Quantization without Backpropagation (2406.13474v1)

Published 19 Jun 2024 in cs.LG and cs.AI

Abstract: Quantization is a promising solution for deploying large-scale LLMs on resource-constrained devices. Existing quantization approaches, however, rely on gradient-based optimization, regardless of it being post-training quantization (PTQ) or quantization-aware training (QAT), which becomes problematic for hyper-scale LLMs with billions of parameters. This overhead can be alleviated via recently proposed backpropagation-free PTQ methods; however, their performance is somewhat limited by their lack of consideration of inter-layer dependencies. In this paper, we thus propose a novel PTQ algorithm that considers inter-layer dependencies without relying on backpropagation. The fundamental concept involved is the development of attention-aware Hessian matrices, which facilitates the consideration of inter-layer dependencies within the attention module. Extensive experiments demonstrate that the proposed algorithm significantly outperforms conventional PTQ methods, particularly for low bit-widths.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Junhan Kim (42 papers)
  2. Ho-young Kim (8 papers)
  3. Eulrang Cho (4 papers)
  4. Chungman Lee (3 papers)
  5. Joonyoung Kim (6 papers)
  6. Yongkweon Jeon (8 papers)