MR-GPTQ: Optimized FP4 Quantization
- MR-GPTQ is a quantization algorithm that enhances FP4 inference through block-wise Hadamard rotations and format-specific scale optimization.
- It reduces quantization errors and improves LLM speed, achieving up to 96.1% recovery of FP16 accuracy across diverse benchmarks.
- The integration into GPU GEMM kernels minimizes overhead, enabling efficient deployment on NVIDIA and AMD platforms for real-world applications.
Micro-Rotated-GPTQ (MR-GPTQ) is a specialized post-training quantization (PTQ) algorithm developed for the emerging class of “microscaling” 4-bit floating-point (FP4) formats, specifically MXFP4 (E2M1 elements with E8M0 group scales, G = 32) and NVFP4 (E2M1 elements with E4M3 scales, G = 16), as supported in NVIDIA Blackwell and contemporary AMD GPUs. MR-GPTQ augments the classic GPTQ quantization framework with block-wise Hadamard rotations and format-specific optimizations to address the unique challenges posed by FP4 inference, achieving significant improvements in both accuracy and speed for LLMs under these hardware regimes (Egiazarian et al., 27 Sep 2025).
1. Motivation and Background
The deployment of LLMs on recent GPU platforms has been accelerated by hardware-native support for 4-bit floating-point operations using FP4 microscaling formats. MXFP4 and NVFP4 aim to provide high throughput and reduced memory footprint, but in practice, standard quantization schemes—originally designed for integer or higher-precision floats—do not realize their theoretical performance and accuracy. Two principal challenges are identified: NVFP4’s small group size (G = 16) neutralizes traditional outlier mitigation via block-diagonal transforms, while MXFP4’s power-of-two scaled quantization (E8M0) incurs large rounding errors, reducing both the average and top-element mean-squared error (MSE). These limitations necessitate format-specialized PTQ algorithms that can exploit the architectural properties of FP4 hardware (Egiazarian et al., 27 Sep 2025).
2. Mathematical Principles and Algorithmic Workflow
MR-GPTQ employs the following mathematical core:
- Let be the weight matrix, and the calibration activations.
- The algorithm partitions (and ) into contiguous blocks of columns, where is a power-of-two (commonly 16 or 32).
- Each block undergoes a block-wise Hadamard rotation: , where is the Hadamard matrix.
- The rotated weight submatrix in each block is . During inference, activations are similarly rotated: .
- Quantization follows via absmax scaling on each group: , and values are quantized onto the E2M1 element grid, with quantized onto the group scale grid specific to NVFP4 (E4M3) or MXFP4 (E8M0).
- An alternating minimization optimizes per-tensor/global scale and per-group scales for minimal .
- GPTQ’s second-order step refines each column by minimizing output reconstruction error , utilizing a block-diagonal inverse-Hessian shortcut.
Key loss functions include block-MSE in the rotated space, MSE-relative for stability, and overall output error as measured on calibration activations. Rotation disperses outlier magnitudes, approximating the error profile as Gaussian and improving both the average and top-element MSE (Egiazarian et al., 27 Sep 2025).
3. FP4 Format-Specific Strategies
NVFP4 and MXFP4 require distinct quantization strategies:
- NVFP4 uses E4M3 scales and supports global scale search ( plus per-group ), leveraging finer scale granularity.
- MXFP4 enforces E8M0 (power-of-two) scales, meaning and only per-block can be optimized, leading to greater quantization error which is partially alleviated by the Hadamard rotation.
- Outlier handling is minimal in NVFP4, as the small group size (G=16) renders absmax scaling nearly equivalent to clipping, precluding further benefit from block-diagonal transforms.
- In both cases, offline fusion of the inverse Hadamard rotation into the stored weights enables efficient inference (Egiazarian et al., 27 Sep 2025).
The following table summarizes salient features of these two FP4 formats under MR-GPTQ:
| Format | Scale Type | Group Size (G) | Rotation Strategy | Scale Optimization |
|---|---|---|---|---|
| NVFP4 | E4M3 | 16 | Block Hadamard | Global and per-block |
| MXFP4 | E8M0 (pwr-of-2) | 32 | Block Hadamard | Per-block only |
4. GPU Kernel Implementation and Efficiency
MR-GPTQ integrates format-specific rotations and quantization directly into CUDA GEMM kernels using a lightweight epilogue based on CUTLASS. This approach, referenced as QuTLASS in the data, executes block-diagonal Hadamard transforms, quantization, and 4-bit element packing as part of the GEMM epilogue, incurring an overhead of less than 5% relative to ideal 4-bit throughput. Since , these rotations remain memory-bound and introduce negligible computational overhead when compared to matrix multiplication FLOPs. Block sizes are supported depending on GPU SM generation (SM100/SM120). By fusing the pre-rotation into weights offline, only activations are transformed online, minimizing additional memory traffic (Egiazarian et al., 27 Sep 2025).
5. Empirical Results and Performance Analysis
The introduction of MR-GPTQ results in substantial speed and accuracy improvements across a range of LLM benchmarks:
Layer-wise Speedup (W4A4 vs. FP16):
- NVIDIA B200 (Blackwell): up to 3.6× (ideal: 4×)
- NVIDIA RTX 5090: up to 6× (ideal: 8×)
End-to-End Speedup in vLLM (Llama-3.3-70B-Instruct):
- Batch 1–16 on B200: 2.0×–2.2× vs BF16 baseline
- Batch 1–16 on RTX 5090: 3.5×–4.0× vs BF16 baseline
Accuracy on Llama-3.1-8B-Instruct (W4A4, zero-shot):
- FP16: avg. 78.93
- INT4 (GPTQ+HT): avg. 75.72 (95.9% recovery)
- NVFP4 (MR-GPTQ): avg. 75.84 (96.1% recovery)
- MXFP4 (MR-GPTQ): avg. 73.65 (93.3% recovery)
Recovery rates across model sizes:
- NVFP4+GPTQ: 94–99% of FP16 accuracy
- MXFP4+MR-GPTQ: 90–98%, closing the prior 8–10 point gap
These results empirically confirm that MR-GPTQ, via format-specialized pre-rotations and second-order updates, enables FP4 deployment with minimal loss in accuracy while achieving large-scale inference speedups (Egiazarian et al., 27 Sep 2025).
6. Context, Implications, and Trade-Offs
MR-GPTQ represents a method to exploit the accuracy–performance trade-offs of FP4 formats, unlocking the practical utility of microscaling quantization in LLMs. The algorithm’s robust post-training quantization is contingent on format-specific optimizations—block-wise Hadamard rotations for error dispersion and tailored scale searches aligned with each FP4 format's group scaling behavior. Empirical evidence illustrates that while FP4 is not an automatic upgrade over INT4, MR-GPTQ closes accuracy gaps previously considered intractable for power-of-two scaled formats like MXFP4.
A plausible implication is that the combination of classic second-order PTQ schemes and block-structured orthogonal transforms (Hadamard) constitutes a generally applicable strategy for quantization on low-precision floating-point hardware. However, the optimality of such techniques is tightly coupled to the scaling granularity and format characteristics present in hardware designs. The findings highlight the necessity of algorithm–hardware co-design in actualizing efficient, accurate LLM inference at ultralow precision (Egiazarian et al., 27 Sep 2025).