Additive Quantization for Language Models (AQLM)
- The paper presents AQLM, which generalizes classic additive quantization to LLMs, achieving Pareto-optimal trade-offs at 2–3 bits per parameter.
- It employs input-adaptive code assignment and joint block-wise optimization via an EM-style process to minimize output distortion during calibration.
- The method enables up to 8× model size reduction with notable speedups on both GPU and CPU, facilitating efficient on-device inference.
Additive Quantization for LLMs (AQLM) is an advanced post-training compression technique developed for extreme quantization of LLMs, such as transformer-based architectures. AQLM generalizes the classic Additive Quantization (AQ) approach, traditionally used in information retrieval, to quantize weight matrices in LLMs to exceptionally low bit-counts—specifically targeting the 2 to 3 bits-per-parameter regime. By integrating input-adaptive code assignment and joint block-wise optimization of quantization parameters, AQLM achieves Pareto-optimal trade-offs between accuracy and model size, making it practical for deployment on resource-constrained devices (Egiazarian et al., 2024).
1. Formal Problem Statement
AQLM addresses the problem of compressing pretrained transformer LLMs by replacing the floating-point weight matrices with quantized approximations using only bits per parameter, with the principal focus on . This compression yields up to reduction in model size compared to FP16 baselines.
Classic AQ encodes groups of model weights as sums of codebook vectors (centroids) chosen from learned codebooks , with assignments governed by one-hot vectors . Row is approximated as , and total bit cost is determined by codebook size and group granularity. The AQ layer-level reconstruction objective is:
AQLM reframes the objective to preserve layer outputs on a calibration set:
where is a matrix of calibration inputs.
2. Algorithmic Innovations
AQLM advances AQ via two central mechanisms:
- Input-adaptive quantization: Code assignments are data-aware, chosen to minimize output distortion for a specific set of calibration inputs rather than purely weight-level reconstruction.
- Joint block-wise codebook optimization: Quantization errors from multiple linear layers in a transformer block are addressed collectively by fine-tuning codebooks, scaling parameters , and remaining small parameters to minimize output mismatch at the block level.
The loss for block-level optimization is:
and for the full model,
where typically is small or zero.
Optimization proceeds via an EM-style process:
- E-step: Updates code assignments through beam-search in a Markov Random Field formulation (MRF), leveraging precomputed gram matrices.
- M-step: Refines codebooks , scales , and small non-quantized parameters using Adam optimizer.
Pseudo-code succinctly describes calibrating a block:
| Step | Description |
|---|---|
| Codebook Initialization | Residual K-means on weight matrix rows |
| Gram Matrix | Precompute |
| E-step | Beam-search code assignment per output unit/group |
| M-step | Adam updates on , , to minimize block loss |
3. Theoretical Analysis
- Reconstruction Bounds: For codebooks of size , and assignments minimizing MSE,
with the closest of prototypes and as grow (cf. Babenko & Lempitsky 2014). Empirically, sub-percent layer-level error arises with , .
- Pareto Frontier: For LLaMA 2-7B on WikiText2,
- FP16 compression: 5.12 PPL @ 8 bytes/param
- QuIP# (2 bit): 8.22 PPL @ 2.02 bits/param
- AQLM (2 bit): 6.64 PPL @ 2.02 bits/param
AQLM is strictly Pareto-optimal versus all prior 2-bit methods for perplexity versus model size, outperforming certain higher-bit (e.g., 4-bit GPTQ) baselines on smaller models.
4. Implementation and Empirical Performance
AQLM supports high-throughput inference on both GPU and CPU via codebook lookup tables:
- GPU kernel: Precomputes lookups for each group, with additions per group; achieves FP16 speed on RTX 3090 (LLaMA 2-70B).
- CPU kernel: Splits each 16-bit codebook into $8$-bit sub-codebooks so lookups reside in L1/L2 cache; up to FP32 speedup on a 16-core Intel i9.
Model footprint is reduced by at 2 bits/parameter relative to FP16, while maintaining or exceeding speed.
Summary of token generation rates:
| Device | FP16 | AQLM (2 bit) | AQLM (2×8 bit) |
|---|---|---|---|
| RTX 3090, LLaMA-2 7B | 41.5 tok/s | 32.2 tok/s | 32.6 tok/s |
| Intel i9, LLaMA-2 7B | 3.1 tok/s | 7.0 tok/s | 6.8 tok/s |
5. Compression–Accuracy Trade-Offs and Calibration
Several operational variables modulate AQLM’s effectiveness:
- Calibration set size: Gains saturate around 2,000 calibration sequences (useful range: 512–4,096).
- Codebook number , bits , group size : Increasing improves accuracy with fixed budget but incurs higher E-step computational cost.
- Block-wise fine-tuning: 100–300 Adam steps increase calibration time by $10$–, securing $5$– PPL reduction.
AQLM thus generalizes well with modest one-shot calibration effort, especially compared to direct PTQ methods.
6. Limitations and Future Prospects
AQLM is the first post-training quantization scheme to reach Pareto optimality below 3 bits/parameter on open LLMs, yielding state-of-the-art perplexity and zero-shot accuracy in the extreme quantization regime. Noted limitations:
- Calibration cost (beam-search E-step) exceeds direct PTQ approaches (e.g., RTN/GPTQ), but remains practical for one-shot application.
- Homogeneous codebooks: Current versions employ fixed codebook architectures; incorporating sparsity or non-uniform (layer-dependent) bit allocation could permit further gains.
- Activation quantization: Extending AQLM to quantized activation flows (quantization-aware inference) could push bit-efficiency below 2 bits.
These results indicate the scalable adaptation of multi-codebook quantization for extreme LLM compression, facilitating efficient on-device inference at high fidelity (Egiazarian et al., 2024).