- The paper introduces any4, a novel learned 4-bit quantization scheme that leverages per-row codebook learning and group-wise scaling to minimize reconstruction error.
- It employs an activation-weighted K-means method for calibration, enabling efficient quantization with a single, diverse prompt and reducing implementation overhead.
- Empirical results across various LLM families demonstrate that any4 improves perplexity and task accuracy while supporting low-latency inference with the tinygemm library.
any4: Learned 4-bit Numeric Representation for LLMs
The paper introduces any4, a learned 4-bit quantization scheme for LLMs, and demonstrates its superiority over existing 4-bit numeric formats and quantization algorithms. The authors also present tinygemm, a GPU-optimized matrix multiplication library supporting any4 and other quantization methods, targeting low-latency inference scenarios.
Motivation and Context
Efficient inference of LLMs is constrained by memory bandwidth and storage, especially in datacenter and edge deployments. Reducing model parameter precision to 4 bits is a common strategy, but existing numeric formats—int4, fp4, and nf4—are suboptimal in matching the true distribution of neural network weights, which are typically heavy-tailed and non-uniform. Furthermore, many state-of-the-art quantization methods require pre-processing of weights or activations, increasing implementation complexity and calibration cost.
any4: Algorithmic Overview
any4 departs from fixed-format quantization by learning a per-row codebook of 16 floating-point values (for 4 bits) that best reconstructs the original weights under a group-wise scaling regime. The quantization process is as follows:
- Group-wise Scaling: Each group of weights (default group size 128) is scaled and offset to match the dynamic range of the quantized format.
- Per-row Codebook Learning: For each row, a codebook of 16 values is learned using a weighted K-means clustering procedure. The optimization minimizes the expected mean squared error in the output activations, not just the weights, by incorporating representative input activations during calibration.
- Assignment and LUT Construction: Each weight is assigned to its nearest codebook value, and the codebook is stored as a lookup table (LUT) per row.
- Calibration: Unlike prior work requiring hundreds of calibration samples, any4 achieves optimal performance using a single, hand-curated, diverse prompt.
The process is formalized as an alternating minimization (E-step and M-step) akin to K-means, but with cluster centroids updated using activation-weighted means. The codebook learning is parallelized across rows, enabling rapid quantization of large models.
Pseudocode for any4 Quantization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
for module in model:
w = module.weight()
wQ = torch.zeros_like(w)
alpha, beta = [], []
for i in range(w.shape[0]):
wSi, alphai, betai = scale(w[i, :])
xi = calibration_activations[module][i]
wQ[i, :] = kmeans(
samples=wSi,
sample_weight=alphai * abs(xi.mean())
)
alpha.append(alphai)
beta.append(betai)
module.weight.data = wQ
module.alpha = alpha
module.beta = beta |
tinygemm: Efficient Inference Implementation
tinygemm is a CUDA-based GEMM library optimized for small batch sizes (1–16), which are typical in LLM inference. It supports int4, nf4, and any4 quantization, and achieves high throughput by:
- Laying out matrices in memory to match tensor core tile formats, avoiding shared memory transpositions.
- Dequantizing weights on-the-fly using per-row LUTs, leveraging GPU warp shuffle instructions for efficient lookup.
- Packing quantized data to maximize memory bandwidth utilization.
Benchmarks show that int4 achieves up to 3× speedup over bfloat16, while any4 and nf4 reach up to 2×, with any4 incurring minimal additional overhead despite per-row LUTs.
Empirical Results
The paper provides extensive evaluation across Llama2, Llama3, Mistral, and Mixtral model families and sizes (1B–70B parameters). Key findings include:
- Perplexity and Task Accuracy: any4 consistently achieves lower perplexity and higher downstream task accuracy than int4, fp4, and nf4 across all tested models and datasets.
- Comparison with Orthogonal Methods: any4 is competitive with, and sometimes outperforms, advanced quantization algorithms such as AWQ, GPTQ, and QuIP, despite not requiring pre-processing.
- Calibration Efficiency: Using a single, diverse prompt for calibration matches or exceeds the performance of using large calibration datasets.
- Robustness to Group Size: any4 maintains stable performance even as quantization group size increases, unlike fp4 and nf4, which degrade significantly at large group sizes.
Selected Numerical Results (Llama3 8B, C4 Perplexity)
Format |
Perplexity (↓) |
FP16 |
8.93 |
int4 |
9.89 |
fp4 |
10.22 |
nf4 |
9.52 |
any4 |
9.40 |
Ablation Studies
- Calibration Data: A single, hand-written prompt yields better or equivalent results compared to hundreds of samples from standard datasets.
- K-means Initialization: k-means++ initialization outperforms random or fixed-value seeding.
- Optimization Target: Minimizing activation-weighted error (not just weight error) is essential for optimal quantization.
Practical Implications
- Deployment: any4 enables high-accuracy, low-memory LLM inference without the need for complex calibration pipelines or hardware-specific numeric formats.
- Integration: The open-sourced tinygemm library can be integrated into existing inference stacks (e.g., PyTorch, Hugging Face Transformers) with minimal changes.
- Scalability: The method is parallelizable and efficient for models up to 70B parameters.
Limitations and Future Directions
- LUT Overhead: any4 introduces a small per-row LUT overhead (0.0625 bits/entry), which is negligible compared to the overall memory savings.
- Orthogonal Techniques: While any4 is competitive with pre-processing-based methods, combining any4 with such techniques (e.g., AWQ, GPTQ) may yield further gains.
- Generalization: The impact of calibration data on bias and truthfulness remains an open question.
Theoretical and Future AI Implications
any4 demonstrates that learned, data-driven quantization formats can outperform fixed numeric representations, especially as model sizes and deployment constraints grow. The approach suggests a broader trend toward adaptive, model-specific quantization in both research and production. Future work may explore:
- Joint optimization of quantization and model architecture.
- Extension to activation and gradient quantization for training efficiency.
- Hardware co-design for LUT-based quantization schemes.
In summary, any4 provides a practical, high-accuracy, and efficient solution for 4-bit LLM quantization, with strong empirical results and a clear path for integration and further research.