any4: Learned 4-bit Numeric Representation for LLMs (2507.04610v1)

Published 7 Jul 2025 in cs.LG and cs.AI

Abstract: We present any4, a learned 4-bit weight quantization solution for LLMs providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at https://github.com/facebookresearch/any4 .

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces any4, a novel learned 4-bit quantization scheme that leverages per-row codebook learning and group-wise scaling to minimize reconstruction error.
It employs an activation-weighted K-means method for calibration, enabling efficient quantization with a single, diverse prompt and reducing implementation overhead.
Empirical results across various LLM families demonstrate that any4 improves perplexity and task accuracy while supporting low-latency inference with the tinygemm library.

any4: Learned 4-bit Numeric Representation for LLMs

The paper introduces any4, a learned 4-bit quantization scheme for LLMs, and demonstrates its superiority over existing 4-bit numeric formats and quantization algorithms. The authors also present tinygemm, a GPU-optimized matrix multiplication library supporting any4 and other quantization methods, targeting low-latency inference scenarios.

Motivation and Context

Efficient inference of LLMs is constrained by memory bandwidth and storage, especially in datacenter and edge deployments. Reducing model parameter precision to 4 bits is a common strategy, but existing numeric formats—int4, fp4, and nf4—are suboptimal in matching the true distribution of neural network weights, which are typically heavy-tailed and non-uniform. Furthermore, many state-of-the-art quantization methods require pre-processing of weights or activations, increasing implementation complexity and calibration cost.

any4: Algorithmic Overview

any4 departs from fixed-format quantization by learning a per-row codebook of 16 floating-point values (for 4 bits) that best reconstructs the original weights under a group-wise scaling regime. The quantization process is as follows:

Group-wise Scaling: Each group of weights (default group size 128) is scaled and offset to match the dynamic range of the quantized format.
Per-row Codebook Learning: For each row, a codebook of 16 values is learned using a weighted K-means clustering procedure. The optimization minimizes the expected mean squared error in the output activations, not just the weights, by incorporating representative input activations during calibration.
Assignment and LUT Construction: Each weight is assigned to its nearest codebook value, and the codebook is stored as a lookup table (LUT) per row.
Calibration: Unlike prior work requiring hundreds of calibration samples, any4 achieves optimal performance using a single, hand-curated, diverse prompt.

The process is formalized as an alternating minimization (E-step and M-step) akin to K-means, but with cluster centroids updated using activation-weighted means. The codebook learning is parallelized across rows, enabling rapid quantization of large models.

Pseudocode for any4 Quantization

for module in model:
    w = module.weight()
    wQ = torch.zeros_like(w)
    alpha, beta = [], []
    for i in range(w.shape[0]):
        wSi, alphai, betai = scale(w[i, :])
        xi = calibration_activations[module][i]
        wQ[i, :] = kmeans(
            samples=wSi,
            sample_weight=alphai * abs(xi.mean())
        )
        alpha.append(alphai)
        beta.append(betai)
    module.weight.data = wQ
    module.alpha = alpha
    module.beta = beta

tinygemm: Efficient Inference Implementation

tinygemm is a CUDA-based GEMM library optimized for small batch sizes (1–16), which are typical in LLM inference. It supports int4, nf4, and any4 quantization, and achieves high throughput by:

Laying out matrices in memory to match tensor core tile formats, avoiding shared memory transpositions.
Dequantizing weights on-the-fly using per-row LUTs, leveraging GPU warp shuffle instructions for efficient lookup.
Packing quantized data to maximize memory bandwidth utilization.

Benchmarks show that int4 achieves up to 3× speedup over bfloat16, while any4 and nf4 reach up to 2×, with any4 incurring minimal additional overhead despite per-row LUTs.

Empirical Results

The paper provides extensive evaluation across Llama2, Llama3, Mistral, and Mixtral model families and sizes (1B–70B parameters). Key findings include:

Perplexity and Task Accuracy: any4 consistently achieves lower perplexity and higher downstream task accuracy than int4, fp4, and nf4 across all tested models and datasets.
Comparison with Orthogonal Methods: any4 is competitive with, and sometimes outperforms, advanced quantization algorithms such as AWQ, GPTQ, and QuIP, despite not requiring pre-processing.
Calibration Efficiency: Using a single, diverse prompt for calibration matches or exceeds the performance of using large calibration datasets.
Robustness to Group Size: any4 maintains stable performance even as quantization group size increases, unlike fp4 and nf4, which degrade significantly at large group sizes.

Selected Numerical Results (Llama3 8B, C4 Perplexity)

Format	Perplexity (↓)
FP16	8.93
int4	9.89
fp4	10.22
nf4	9.52
any4	9.40

Ablation Studies

Calibration Data: A single, hand-written prompt yields better or equivalent results compared to hundreds of samples from standard datasets.
K-means Initialization: k-means++ initialization outperforms random or fixed-value seeding.
Optimization Target: Minimizing activation-weighted error (not just weight error) is essential for optimal quantization.

Practical Implications

Deployment: any4 enables high-accuracy, low-memory LLM inference without the need for complex calibration pipelines or hardware-specific numeric formats.
Integration: The open-sourced tinygemm library can be integrated into existing inference stacks (e.g., PyTorch, Hugging Face Transformers) with minimal changes.
Scalability: The method is parallelizable and efficient for models up to 70B parameters.

Limitations and Future Directions

LUT Overhead: any4 introduces a small per-row LUT overhead (0.0625 bits/entry), which is negligible compared to the overall memory savings.
Orthogonal Techniques: While any4 is competitive with pre-processing-based methods, combining any4 with such techniques (e.g., AWQ, GPTQ) may yield further gains.
Generalization: The impact of calibration data on bias and truthfulness remains an open question.

Theoretical and Future AI Implications

any4 demonstrates that learned, data-driven quantization formats can outperform fixed numeric representations, especially as model sizes and deployment constraints grow. The approach suggests a broader trend toward adaptive, model-specific quantization in both research and production. Future work may explore:

Joint optimization of quantization and model architecture.
Extension to activation and gradient quantization for training efficiency.
Hardware co-design for LUT-based quantization schemes.

In summary, any4 provides a practical, high-accuracy, and efficient solution for 4-bit LLM quantization, with strong empirical results and a clear path for integration and further research.