Papers
Topics
Authors
Recent
2000 character limit reached

Quantized GEMM with Learned Low-Bit Formats

Updated 12 December 2025
  • The paper introduces learned low-bit quantization methods that dynamically allocate bitwidths using group statistics to maintain model accuracy.
  • It details techniques such as microscaling floating-point formats and grouped lattice vector quantization to minimize quantization error.
  • Empirical evaluations show significant throughput gains, memory savings, and competitive performance compared to static uniform quantization.

Quantized GEMM (General Matrix-Matrix Multiplication) with Learned Low-Bit Formats encompasses a set of algorithmic and hardware techniques for accelerating large neural network inference by compressing weights and activations into low-bit numeric representations whose mappings are derived through data-driven, learning-based, or adaptive methods. Contrary to static quantization (e.g., uniform INT4/INT8), these approaches exploit local statistics or optimization objectives to allocate bitwidths, determine quantization centroids, or jointly select scaling/format parameters, often in a mixed-precision fashion. The result is efficient matrix multiplication that maintains high accuracy at substantially reduced precision and memory cost.

1. Principles of Learned and Adaptive Low-Bit Quantization

Low-bit quantization refers to representing matrix entries with fewer than 8 bits, often as few as 2, 3, or 4 bits, while learned or adaptive quantization implies that the quantization mappings (scales, centroids, codebooks) are not fixed but optimized using properties of the model, channel, group, or block. Techniques include:

  • Microscaling floating-point formats: Block-scaled low-bit floating point, where each block shares a learned exponent and each value is stored as a truncated mantissa. The choice of bitwidth per channel is guided by quantization error thresholds derived from exact error bounds (Liu et al., 4 Aug 2025).
  • Grouped Lattice Vector Quantization (GLVQ): Assigns each weight group a learnable lattice codebook (generation matrix) and applies nearest-lattice-point search (Babai rounding) to minimize distortion, with the possibility of mixed or fractional bit budgets across groups (Zhang et al., 23 Oct 2025).
  • Adaptive activation formats: Variable-length mantissa truncation and group-wise shared exponents (e.g., Anda), with mantissa bit allocation determined by post-training module-wise search to satisfy accuracy constraints (Fang et al., 24 Nov 2024).
  • Learned row-wise codebooks: Example: any4/any3, where each weight row learns its own codebook of reproduction values, found via weighted k-means to directly minimize output error, with calibration driven by recorded per-channel statistics on a single diverse prompt (Elhoushi et al., 7 Jul 2025).
  • Lookup-table-based/BCQ methods: Use groupwise binary-coding quantization where bitplane coefficients are adaptively optimized, and GEMM is performed using precomputed lookups indexed by compressed codes (Park et al., 2022).

These approaches generalize or subsume static uniform quantization, yielding lower error at a given bit budget via model-aware format assignment or learned centroids.

2. Algorithmic Workflow: Quantization, Format Assignment, and GEMM Execution

A typical pipeline for learned low-bit quantized GEMM comprises the following:

  1. Partitioning and grouping: Matrices (weights and/or activations) are divided into fixed-size blocks, groups, or channels (e.g., 32 or 128 entries per group).
  2. Statistics collection: Compute per-group/channel max, mean absolute value, or covariance (used for format selection or seed codebook/generation matrix).
  3. Format/bitwidth assignment: Assign the most aggressive bitwidth (smallest b) such that quantization error is below a pre-specified envelope (e.g., INT8's error), using closed-form thresholds on group statistics (Liu et al., 4 Aug 2025).
  4. Learning quantization mappings:
    • For GLVQ: Update groupwise generation matrices and companding parameters via alternating Babai rounding/code assignment and gradient-based updates (Zhang et al., 23 Oct 2025).
    • For any4: Run weighted k-means to learn per-row codebooks minimizing output error, using mean |activation| as weights from calibration samples (Elhoushi et al., 7 Jul 2025).
  5. Encoding: Quantize model weights (and, optionally, activations) into block/group codes using learned mapping.
  6. Supporting formats: For activations, utilize adaptive formats with group-shared exponents and truncation (e.g. Anda), with group mantissa widths chosen via post-training search to meet accuracy constraints (Fang et al., 24 Nov 2024).
  7. GEMM kernel:
    • On hardware with native low-bit support (NVIDIA Blackwell, Ampere): Partition input matrices, load format-specific segments, and perform MMA with on-the-fly dequantization and fused scaling (Liu et al., 4 Aug 2025).
    • On generic GPU/CPU: Use LUT-based kernels, streaming precomputed codebooks and codes from memory and fusing dequantization efficiently in-register (Elhoushi et al., 7 Jul 2025, Park et al., 2022).
    • Blockwise or stream-ordered operation for multi-precision input, with results concatenated post-GEMM.

3. Quantization Error Analysis and Format Selection

Key to adaptive format assignment is rigorous control of quantization error:

  • Error bounds: For floating-point microscaling, per-channel quantization error is Eb=γsbE_b = \gamma s_b, with format-dependent γ\gamma. Channels are promoted to higher-precision if their error exceeds a threshold derived by comparing to INT8's maximal error, ensuring that even lowest bitwidth channels never underperform standard INT8 (Liu et al., 4 Aug 2025).

Closed-form threshold:

T(b)=2(b+b1)maxX254qmaxT(b) = 2^{(b + b - 1)} \cdot \frac{{\max|X|}}{{254 \cdot q_{\max}}}

is used for assigning MXFP4, MXFP6, or MXFP8 per channel.

  • Calibration: Both for GLVQ and any4, a small calibration set (as little as one hand-crafted "kitchen sink" prompt for any4 (Elhoushi et al., 7 Jul 2025), or \sim4M tokens for GLVQ (Zhang et al., 23 Oct 2025)) suffices, as optimization converges rapidly if group/row statistics are representative.
  • Bit allocation search: For adaptive activation representations (e.g., Anda), a greedy or heap-based search minimizes bit-ops per module under an admitted (user-specified) accuracy loss δ\delta, converging in a small number of iterations (Fang et al., 24 Nov 2024).

4. Hardware and Kernel Implementations

Optimized quantized GEMM leverages both software and hardware design:

  • Blackwell Tensor Cores: Native support for MXFP4/6/8, implementing fused quantization, scaling, and accumulation paths. Kernels are instantiated per format and concatenated in the output (Liu et al., 4 Aug 2025).
  • tinygemm and LUT-GEMM: GPU-optimized CUDA libraries that fuse lookup, dequant, and multiplication in tensor-core matrix multiply-accumulate pipelines, with specialized data layouts for maximizing register and cache coherence (Elhoushi et al., 7 Jul 2025, Park et al., 2022).
  • SIMD CPU kernels: Ultra low-precision (≤4-bit) kernels employ register- or cache-resident look-up tables for SIMD architectures (e.g., AVX2), maximizing throughput by replacing multiplications with branch-free LUT lookups and efficient bit unpacking (Ganji et al., 2023).
  • Anda APU: Custom bit-serial processing unit that accommodates runtime-determined mantissa width M, grouped exponent handling, and plane-wise accumulation, significantly reducing energy and area compared to FP16 baselines (Fang et al., 24 Nov 2024).

5. Empirical Results and Trade-offs

The impact of learned low-bit quantized GEMM is evidenced by several key metrics:

Approach Accuracy Relative to FP16 Throughput Improvement Memory/Area Savings Typical Use Cases
MicroMix (MXFP4/6/8) ≥95% zero-shot, ≥90% 5-shot (Liu et al., 4 Aug 2025) 8–46% > TensorRT-FP8 ~20% memory saved LLMs (Llama, Qwen) on GPU
GLVQ (4-bit, group) 10–20% lower perplexity at extreme low bits (Zhang et al., 23 Oct 2025) +2–3% latency vs 4b PTQ Negligible overhead LLM weight compression (GPU/CPU)
Anda (adaptive) <0.2% PPL loss at 3× BOP savings (Fang et al., 24 Nov 2024) 2.1–2.5× over FP16 3–4× area/energy Activation quantization (FPINT GEMM)
any4/tinygemm 1–2% downstream drop vs FP16, lower than INT4/NF4 (Elhoushi et al., 7 Jul 2025) 1.8–2× over BF16 <5% storage overhead Transformer LLMs (small-batch GPU)
DeepGEMM/LUT-GEMM ~2% drop (CNNs), 1.5–1.8× over 8b INT (Ganji et al., 2023Park et al., 2022) 2.1× over dequant FP16 See tables above CPU, edge deployment

Across architectures and precision formats, learned low-bit quantization ensures competitive or superior accuracy versus static uniform formats, often matching FP8 or outperforming INT4, with substantial throughput and memory benefits.

6. Design Advantages, Limitations, and Practical Recommendations

Advantages:

  • Statistical or learned format assignment mitigates catastrophic quantization error on outlier channels or groups.
  • Group- and row-specific codebooks (GLVQ, any4) adapt to local structure, yielding lower distortion than fixed quantization grids.
  • Fusion of quantization and arithmetic (e.g., via hardware MMA) amortizes the cost of scaling and dequantization, maximizing throughput.
  • Flexible support for heterogeneous precision budgets and mixed/fractional bitwidths.
  • Minimal calibration overhead for learning group/row statistics.

Limitations:

  • GPU kernels must be carefully engineered to avoid shared memory bottlenecks and to optimize for tile size and register usage.
  • Overhead for managing per-group/row codebooks and scales is negligible for standard group sizes (g=128) but grows if much finer partitioning is used.
  • Current methods (GLVQ) generally focus on weight quantization; full dual-path low-bit quantized activations (as in MicroMix/Anda) require additional hardware/software support.
  • Some formats (GLVQ, any4) require direct calibration statistics from the target data domain.

Practical recommendations:

  • For LLMs on Blackwell-class GPUs, use mixed-precision MXFP4/6/8 with error-driven channel assignment (Liu et al., 4 Aug 2025).
  • To aggressively compress weights under minimal quality loss, groupwise lattice or learned codebook formats with babai rounding and k-means codebook learning are preferred (Zhang et al., 23 Oct 2025, Elhoushi et al., 7 Jul 2025).
  • For FPINT GEMM inference on custom hardware, pair INT4 weights with adaptive grouped-activation formats (e.g. Anda) using hardware bit-serial datapaths (Fang et al., 24 Nov 2024).
  • For CPU and legacy GPU, LUT-based BCQ or codebook GEMM kernels deliver near-optimal memory and arithmetic efficiency (Ganji et al., 2023, Park et al., 2022).

7. Comparative Context and Future Directions

Quantized GEMM with learned low-bit formats has become a critical enabler for deploying large models under strict resource constraints, such as edge AI, low-latency inference, and large-scale deployment in datacenters. Compared to static INTn/PTQ, these approaches consistently offer better trade-offs in the extreme low-bit regime, approaching full-precision accuracy at 2–4 bits per element.

A plausible implication is that future research will continue to integrate format and kernel co-design with system architecture, increase support for joint weight-and-activation quantization, and further automate bitwidth allocation through meta-learning or fast calibration. With growing hardware-native support for sub-8-bit formats and efficient LUT-based arithmetic, quantized GEMM with learned low-bit formats is likely to become the default for high-performance LLM and vision inference (Liu et al., 4 Aug 2025, Zhang et al., 23 Oct 2025, Fang et al., 24 Nov 2024, Elhoushi et al., 7 Jul 2025, Ganji et al., 2023, Park et al., 2022).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Quantized GEMM with Learned Low-Bit Formats.