DuQuant++: Efficient 4-bit Quantization
- The paper introduces DuQuant++, a PTQ method that optimizes outlier suppression via a fine-grained, block-diagonal orthogonal rotation aligned with MXFP4 groups to enhance quantization accuracy.
- It reduces computational cost by applying a single 32×32 rotation per group, halving the rotation overhead compared to dual-rotation methods.
- Empirical evaluations on LLaMA-3 variants demonstrate state-of-the-art performance with lower perplexity and improved QA accuracy among 4-bit quantization approaches.
DuQuant++ is a post-training quantization (PTQ) method that enhances the efficiency and accuracy of neural network inference under the MXFP4 microscaling format, which is natively supported by NVIDIA Blackwell Tensor Cores for LLMs. By adapting outlier-aware, fine-grained rotation mechanisms to precisely match the microscaling group structure of MXFP4, DuQuant++ delivers state-of-the-art accuracy and a substantial reduction in online computational overhead for 4-bit weight/activation quantization (W4A4) (Lin et al., 20 Apr 2026).
1. Background: Microscaling Quantization and Outlier-Induced Error
NVIDIA’s MXFP4 (“E8M0”) format partitions weight or activation tensors into contiguous groups of values (microscaling groups), with each group sharing a single 8-bit exponent (scale) and a zero-bit mantissa. Given , the tensor is divided into blocks (), and each element is quantized as
where is the block scale, and span the FP4 dynamic range.
The microscaling mechanism achieves high throughput and hardware efficiency but is highly susceptible to group-local outliers—single elements with unusually large magnitude—which inflate and compress the quantization range for all non-outlier elements in the block. This effect is especially pronounced in LLM activations and weights, which routinely display both “normal” (broadly distributed) and “massive” (rare, extreme) outliers. Conventional rotation-based quantization stabilization methods (randomized Hadamard, learnable data-agnostic rotations) fail to target channels with concentrated outlier activity (Lin et al., 20 Apr 2026).
2. DuQuant++ Methodology: Outlier-Aware Rotation Aligned with Microscaling Groups
DuQuant++ extends the core technique of DuQuant, optimizing outlier suppression by aligning the rotation block size directly with the MXFP4 microscaling group (). The method proceeds as follows:
- Affine Rebalancing (SmoothQuant-style): Preceding rotation, an affine transformation rebalances the contributions of weights and activations using
0
where 1 trades off outlier handling between activations and weights.
- Block-Diagonal Orthogonal Rotation: A single block-diagonal orthogonal rotation 2, with 3, is applied. Each rotation block is specifically optimized (greedily over Givens rotations) to minimize MXFP4 quantization error:
4
where 5 is the 6-th column and 7 is its quantized reconstruction.
A key distinction from the original DuQuant is the removal of the second rotation and permutation: since each MXFP4 group now independently scales, the need to redistribute block-level variance using permutations is eliminated.
The composite quantized linear layer thus takes the form
8
with 9 and 0 fully quantized to MXFP4, and only a single inference-time matrix multiply 1 required (as 2 is absorbed offline into the weights).
3. Computational Efficiency and Rotation Cost Analysis
DuQuant++ substantially reduces the online cost of quantization stabilization in comparison to previous methods:
- Original DuQuant (integer quantization): Two 3 rotations plus a permutation per block: 4.
- DuQuant++ (MXFP4): Single 5 rotation per block: 6.
Given 7 FLOPs for a 8 multiply, DuQuant++ approximately halves the constant factor in rotation cost and entirely eliminates the permutation overhead. Empirical analysis shows a 2× reduction in total rotation overhead for GPU kernels (Lin et al., 20 Apr 2026).
Efficient implementation is enabled by reusing the same 9 rotation 0 across all activation groups, allowing the matrix to reside in shared memory (1 KiB) and minimizing branching or kernel launch costs.
4. Implementation Details and Algorithmic Procedure
The method is deployed atop NVIDIA Blackwell Tensor Cores, which offer native support for MXFP4 GEMM operations and scaling. The quantization pipeline is:
- For each linear layer, compute outlier-aware 2 using a greedy Givens rotation algorithm:
- Initialize 3.
- For 4 to 5 (e.g., 6):
- Find index pair 7 maximizing 8 over all blocks/indices.
- Apply a Givens rotation 9 in plane 0 to reduce peak magnitude.
- Update 1.
- Return 2.
- Fuse 3 offline into quantized weights, so only activations undergo online transformation.
- All 4 rotation matrices are reused per group to economize on-memory and computation.
The design ensures that no additional CUDA kernel calls are needed at runtime for weights, and online rotation is localized to a single efficient kernel invocation (Lin et al., 20 Apr 2026).
5. Experimental Evaluation and Comparative Performance
DuQuant++ was benchmarked on four LLaMA-3 variants, both pre-trained (8B, 3B) and instruction-tuned (8B, 8B.1), under MXFP4 W4A4 quantization. The evaluation employed 128 WikiText2 calibration sequences and assessed both language modeling perplexity (WikiText2, C4) and aggregate QA accuracy on seven tasks. In all tests, DuQuant++ (including its GPTQ-enhanced form, "DuQuant++*") achieved state-of-the-art performance among MXFP4 baselines:
| Method | WikiText2 ↓ | C4 ↓ | Avg QA ↑ |
|---|---|---|---|
| FP16 | 6.14 | 9.46 | 69.1 |
| QuaRot | 9.46 | 15.06 | 62.9 |
| MR-GPTQ | 7.29 | 11.41 | 66.1 |
| DuQuant++ | 7.07 | 11.14 | 66.5 |
| DuQuant++* | 6.88 | 11.06 | 67.1 |
| Method | WikiText2 ↓ | C4 ↓ | Avg QA ↑ |
|---|---|---|---|
| FP16 | 8.31 | 13.03 | 67.8 |
| FlatQuant | 9.25 | 15.27 | 64.9 |
| MR-GPTQ | 9.25 | 14.62 | 65.2 |
| DuQuant++ | 8.91 | 14.30 | 65.9 |
| DuQuant++* | 8.75 | 14.12 | 65.9 |
Figure 1 in the source publication demonstrates that DuQuant++ yields the lowest per-group 5 quantization error 6 across all transformer layers at key positions, outperforming both original activations and Hadamard-rotated activations. Histogram analysis confirms that 7 consistently redistributes outlier mass and smooths long-tailed distributions.
6. Technical Significance, Limitations, and Future Directions
The alignment of rotation blocks (8) with MXFP4 groups is critical for both error mitigation and computational efficiency. This design eliminates inter-block leakage and obviates global permutations. The single outlier-aware rotation suffices under MXFP4’s independent per-block scaling, differing fundamentally from the dual-rotation pipeline required for integer quantization with a global scale.
A core sensitivity of DuQuant++ is its strict alignment with hardware microscaling group size; non-matching block sizes would undermine both hardware compatibility and error control. Extension of the approach to other formats such as NVFP4 (9, E4M3) is plausible, as is integration into hybrid or future microscaling variants.
Current limitations include the exploration of per-block, data-dependent rotations—which would introduce additional memory overhead—and the lack of direct quantization-aware training for rotation matrices 0. Avenues for further work include jointly optimizing 1 with fine-tuning and generalizing the method to novel microscaling quantization schemas (Lin et al., 20 Apr 2026).