Papers
Topics
Authors
Recent
2000 character limit reached

Additive Quantization for Language Models (AQLM)

Updated 31 December 2025
  • The paper presents AQLM, which generalizes classic additive quantization to LLMs, achieving Pareto-optimal trade-offs at 2–3 bits per parameter.
  • It employs input-adaptive code assignment and joint block-wise optimization via an EM-style process to minimize output distortion during calibration.
  • The method enables up to 8× model size reduction with notable speedups on both GPU and CPU, facilitating efficient on-device inference.

Additive Quantization for LLMs (AQLM) is an advanced post-training compression technique developed for extreme quantization of LLMs, such as transformer-based architectures. AQLM generalizes the classic Additive Quantization (AQ) approach, traditionally used in information retrieval, to quantize weight matrices in LLMs to exceptionally low bit-counts—specifically targeting the 2 to 3 bits-per-parameter regime. By integrating input-adaptive code assignment and joint block-wise optimization of quantization parameters, AQLM achieves Pareto-optimal trade-offs between accuracy and model size, making it practical for deployment on resource-constrained devices (Egiazarian et al., 2024).

1. Formal Problem Statement

AQLM addresses the problem of compressing pretrained transformer LLMs by replacing the floating-point weight matrices WRdout×dinW \in \mathbb{R}^{d_{out} \times d_{in}} with quantized approximations W^\hat{W} using only BB bits per parameter, with the principal focus on B23B \approx 2\dots3. This compression yields up to 8×8\times reduction in model size compared to FP16 baselines.

Classic AQ encodes groups of model weights as sums of MM codebook vectors (centroids) chosen from learned codebooks {C(m)}m=1M\{C^{(m)}\}_{m=1}^M, with assignments governed by one-hot vectors b(m)b^{(m)}. Row ww is approximated as wm=1MC(m)b(m)w \approx \sum_{m=1}^M C^{(m)} b^{(m)}, and total bit cost is determined by codebook size and group granularity. The AQ layer-level reconstruction objective is:

EQ(C,b)=i=1doutwim=1MC(m)bi(m)22.E_Q(C, b) = \sum_{i=1}^{d_{out}} \| w_i - \sum_{m=1}^M C^{(m)} b_i^{(m)} \|_2^2.

AQLM reframes the objective to preserve layer outputs on a calibration set:

WXW^XF2=(Wm=1MC(m)b(m))XF2,\| W X - \hat{W} X \|_F^2 = \| (W - \sum_{m=1}^M C^{(m)} b^{(m)}) X \|_F^2,

where XX is a matrix of calibration inputs.

2. Algorithmic Innovations

AQLM advances AQ via two central mechanisms:

  • Input-adaptive quantization: Code assignments bb are data-aware, chosen to minimize output distortion for a specific set of calibration inputs XX rather than purely weight-level reconstruction.
  • Joint block-wise codebook optimization: Quantization errors from multiple linear layers in a transformer block are addressed collectively by fine-tuning codebooks, scaling parameters ss, and remaining small parameters θ\theta to minimize output mismatch at the block level.

The loss for block-level optimization is:

Lblock=Fblock(X)F^block(X;C,b,s)F2L_{block} = \| F_{block}(X) - \hat{F}_{block}(X; C, b, s) \|_F^2

and for the full model,

LAQLM=F()(X())F^()(X();C(),b(),s())F2+λ,mW()m=1MC(,m)b(,m)F2,L_{AQLM} = \sum_{\ell} \| F^{(\ell)}(X^{(\ell)}) - \hat{F}^{(\ell)}(X^{(\ell)}; C^{(\ell)}, b^{(\ell)}, s^{(\ell)}) \|_F^2 + \lambda \sum_{\ell,m} \| W^{(\ell)} - \sum_{m=1}^M C^{(\ell,m)} b^{(\ell,m)} \|_F^2,

where λ\lambda typically is small or zero.

Optimization proceeds via an EM-style process:

  • E-step: Updates code assignments bb through beam-search in a Markov Random Field formulation (MRF), leveraging precomputed gram matrices.
  • M-step: Refines codebooks CC, scales ss, and small non-quantized parameters θ\theta using Adam optimizer.

Pseudo-code succinctly describes calibrating a block:

Step Description
Codebook Initialization Residual K-means on weight matrix rows
Gram Matrix Precompute G=XXG = X X^\top
E-step Beam-search code assignment per output unit/group
M-step Adam updates on CC, ss, θ\theta to minimize block loss

3. Theoretical Analysis

  • Reconstruction Bounds: For MM codebooks of size KK, and assignments minimizing MSE,

E[wmC(m)b(m)22]c(M,K)E[ww22]\mathbb{E}[\|w - \sum_m C^{(m)}b^{(m)}\|_2^2] \leq c(M,K) \mathbb{E}[\|w - w'\|_2^2]

with ww' the closest of MKMK prototypes and c(M,K)0c(M,K) \to 0 as M,KM, K grow (cf. Babenko & Lempitsky 2014). Empirically, sub-percent layer-level error arises with B=23B=2\dots3, M2M\approx2.

  • Pareto Frontier: For LLaMA 2-7B on WikiText2,
    • FP16 compression: 5.12 PPL @ 8 bytes/param
    • QuIP# (2 bit): 8.22 PPL @ 2.02 bits/param
    • AQLM (2 bit): 6.64 PPL @ 2.02 bits/param

AQLM is strictly Pareto-optimal versus all prior 2-bit methods for perplexity versus model size, outperforming certain higher-bit (e.g., 4-bit GPTQ) baselines on smaller models.

4. Implementation and Empirical Performance

AQLM supports high-throughput inference on both GPU and CPU via codebook lookup tables:

  • GPU kernel: Precomputes M×KM\times K lookups for each group, with O(M)O(M) additions per group; achieves 1.2×\sim1.2\times FP16 speed on RTX 3090 (LLaMA 2-70B).
  • CPU kernel: Splits each 16-bit codebook into $8$-bit sub-codebooks so lookups reside in L1/L2 cache; up to 4×4\times FP32 speedup on a 16-core Intel i9.

Model footprint is reduced by 8×8\times at 2 bits/parameter relative to FP16, while maintaining or exceeding speed.

Summary of token generation rates:

Device FP16 AQLM (2 bit) AQLM (2×8 bit)
RTX 3090, LLaMA-2 7B 41.5 tok/s 32.2 tok/s 32.6 tok/s
Intel i9, LLaMA-2 7B 3.1 tok/s 7.0 tok/s 6.8 tok/s

5. Compression–Accuracy Trade-Offs and Calibration

Several operational variables modulate AQLM’s effectiveness:

  • Calibration set size: Gains saturate around 2,000 calibration sequences (useful range: 512–4,096).
  • Codebook number MM, bits BB, group size gg: Increasing MM improves accuracy with fixed BM/gB \cdot M / g budget but incurs higher E-step computational cost.
  • Block-wise fine-tuning: 100–300 Adam steps increase calibration time by $10$–30%30\%, securing $5$–10%10\% PPL reduction.

AQLM thus generalizes well with modest one-shot calibration effort, especially compared to direct PTQ methods.

6. Limitations and Future Prospects

AQLM is the first post-training quantization scheme to reach Pareto optimality below 3 bits/parameter on open LLMs, yielding state-of-the-art perplexity and zero-shot accuracy in the extreme quantization regime. Noted limitations:

  • Calibration cost (beam-search E-step) exceeds direct PTQ approaches (e.g., RTN/GPTQ), but remains practical for one-shot application.
  • Homogeneous codebooks: Current versions employ fixed codebook architectures; incorporating sparsity or non-uniform (layer-dependent) bit allocation could permit further gains.
  • Activation quantization: Extending AQLM to quantized activation flows (quantization-aware inference) could push bit-efficiency below 2 bits.

These results indicate the scalable adaptation of multi-codebook quantization for extreme LLM compression, facilitating efficient on-device inference at high fidelity (Egiazarian et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Additive Quantization for Language Models (AQLM).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube