Papers
Topics
Authors
Recent
Search
2000 character limit reached

DuQuant++: Efficient 4-bit Quantization

Updated 21 April 2026
  • The paper introduces DuQuant++, a PTQ method that optimizes outlier suppression via a fine-grained, block-diagonal orthogonal rotation aligned with MXFP4 groups to enhance quantization accuracy.
  • It reduces computational cost by applying a single 32×32 rotation per group, halving the rotation overhead compared to dual-rotation methods.
  • Empirical evaluations on LLaMA-3 variants demonstrate state-of-the-art performance with lower perplexity and improved QA accuracy among 4-bit quantization approaches.

DuQuant++ is a post-training quantization (PTQ) method that enhances the efficiency and accuracy of neural network inference under the MXFP4 microscaling format, which is natively supported by NVIDIA Blackwell Tensor Cores for LLMs. By adapting outlier-aware, fine-grained rotation mechanisms to precisely match the microscaling group structure of MXFP4, DuQuant++ delivers state-of-the-art accuracy and a substantial reduction in online computational overhead for 4-bit weight/activation quantization (W4A4) (Lin et al., 20 Apr 2026).

1. Background: Microscaling Quantization and Outlier-Induced Error

NVIDIA’s MXFP4 (“E8M0”) format partitions weight or activation tensors into contiguous groups of B=32B = 32 values (microscaling groups), with each group sharing a single 8-bit exponent (scale) and a zero-bit mantissa. Given XRm×nX\in\mathbb{R}^{m\times n}, the tensor is divided into blocks {Xj}j=1N\{X_j\}_{j=1}^N (N=mn/32N = mn/32), and each element xiXjx_i \in X_j is quantized as

Q(xi)=nearest(xisj,qmin,qmax),Q(x_i) = \operatorname{nearest}\left(\left\lfloor\frac{x_i}{s_j}\right\rceil, q_{\min}, q_{\max}\right),

where sj=2log2(maxXj)bs_j = 2^{\left\lfloor\log_2(\max|X_j|)\right\rfloor - b} is the block scale, and [qmin,qmax][q_{\min}, q_{\max}] span the FP4 dynamic range.

The microscaling mechanism achieves high throughput and hardware efficiency but is highly susceptible to group-local outliers—single elements with unusually large magnitude—which inflate sjs_j and compress the quantization range for all non-outlier elements in the block. This effect is especially pronounced in LLM activations and weights, which routinely display both “normal” (broadly distributed) and “massive” (rare, extreme) outliers. Conventional rotation-based quantization stabilization methods (randomized Hadamard, learnable data-agnostic rotations) fail to target channels with concentrated outlier activity (Lin et al., 20 Apr 2026).

2. DuQuant++ Methodology: Outlier-Aware Rotation Aligned with Microscaling Groups

DuQuant++ extends the core technique of DuQuant, optimizing outlier suppression by aligning the rotation block size directly with the MXFP4 microscaling group (B=32B = 32). The method proceeds as follows:

  1. Affine Rebalancing (SmoothQuant-style): Preceding rotation, an affine transformation rebalances the contributions of weights and activations using

XRm×nX\in\mathbb{R}^{m\times n}0

where XRm×nX\in\mathbb{R}^{m\times n}1 trades off outlier handling between activations and weights.

  1. Block-Diagonal Orthogonal Rotation: A single block-diagonal orthogonal rotation XRm×nX\in\mathbb{R}^{m\times n}2, with XRm×nX\in\mathbb{R}^{m\times n}3, is applied. Each rotation block is specifically optimized (greedily over Givens rotations) to minimize MXFP4 quantization error:

XRm×nX\in\mathbb{R}^{m\times n}4

where XRm×nX\in\mathbb{R}^{m\times n}5 is the XRm×nX\in\mathbb{R}^{m\times n}6-th column and XRm×nX\in\mathbb{R}^{m\times n}7 is its quantized reconstruction.

A key distinction from the original DuQuant is the removal of the second rotation and permutation: since each MXFP4 group now independently scales, the need to redistribute block-level variance using permutations is eliminated.

The composite quantized linear layer thus takes the form

XRm×nX\in\mathbb{R}^{m\times n}8

with XRm×nX\in\mathbb{R}^{m\times n}9 and {Xj}j=1N\{X_j\}_{j=1}^N0 fully quantized to MXFP4, and only a single inference-time matrix multiply {Xj}j=1N\{X_j\}_{j=1}^N1 required (as {Xj}j=1N\{X_j\}_{j=1}^N2 is absorbed offline into the weights).

3. Computational Efficiency and Rotation Cost Analysis

DuQuant++ substantially reduces the online cost of quantization stabilization in comparison to previous methods:

  • Original DuQuant (integer quantization): Two {Xj}j=1N\{X_j\}_{j=1}^N3 rotations plus a permutation per block: {Xj}j=1N\{X_j\}_{j=1}^N4.
  • DuQuant++ (MXFP4): Single {Xj}j=1N\{X_j\}_{j=1}^N5 rotation per block: {Xj}j=1N\{X_j\}_{j=1}^N6.

Given {Xj}j=1N\{X_j\}_{j=1}^N7 FLOPs for a {Xj}j=1N\{X_j\}_{j=1}^N8 multiply, DuQuant++ approximately halves the constant factor in rotation cost and entirely eliminates the permutation overhead. Empirical analysis shows a 2× reduction in total rotation overhead for GPU kernels (Lin et al., 20 Apr 2026).

Efficient implementation is enabled by reusing the same {Xj}j=1N\{X_j\}_{j=1}^N9 rotation N=mn/32N = mn/320 across all activation groups, allowing the matrix to reside in shared memory (N=mn/32N = mn/321 KiB) and minimizing branching or kernel launch costs.

4. Implementation Details and Algorithmic Procedure

The method is deployed atop NVIDIA Blackwell Tensor Cores, which offer native support for MXFP4 GEMM operations and scaling. The quantization pipeline is:

  • For each linear layer, compute outlier-aware N=mn/32N = mn/322 using a greedy Givens rotation algorithm:
  1. Initialize N=mn/32N = mn/323.
  2. For N=mn/32N = mn/324 to N=mn/32N = mn/325 (e.g., N=mn/32N = mn/326):
    • Find index pair N=mn/32N = mn/327 maximizing N=mn/32N = mn/328 over all blocks/indices.
    • Apply a Givens rotation N=mn/32N = mn/329 in plane xiXjx_i \in X_j0 to reduce peak magnitude.
    • Update xiXjx_i \in X_j1.
  3. Return xiXjx_i \in X_j2.
  • Fuse xiXjx_i \in X_j3 offline into quantized weights, so only activations undergo online transformation.
  • All xiXjx_i \in X_j4 rotation matrices are reused per group to economize on-memory and computation.

The design ensures that no additional CUDA kernel calls are needed at runtime for weights, and online rotation is localized to a single efficient kernel invocation (Lin et al., 20 Apr 2026).

5. Experimental Evaluation and Comparative Performance

DuQuant++ was benchmarked on four LLaMA-3 variants, both pre-trained (8B, 3B) and instruction-tuned (8B, 8B.1), under MXFP4 W4A4 quantization. The evaluation employed 128 WikiText2 calibration sequences and assessed both language modeling perplexity (WikiText2, C4) and aggregate QA accuracy on seven tasks. In all tests, DuQuant++ (including its GPTQ-enhanced form, "DuQuant++*") achieved state-of-the-art performance among MXFP4 baselines:

Method WikiText2 ↓ C4 ↓ Avg QA ↑
FP16 6.14 9.46 69.1
QuaRot 9.46 15.06 62.9
MR-GPTQ 7.29 11.41 66.1
DuQuant++ 7.07 11.14 66.5
DuQuant++* 6.88 11.06 67.1
Method WikiText2 ↓ C4 ↓ Avg QA ↑
FP16 8.31 13.03 67.8
FlatQuant 9.25 15.27 64.9
MR-GPTQ 9.25 14.62 65.2
DuQuant++ 8.91 14.30 65.9
DuQuant++* 8.75 14.12 65.9

Figure 1 in the source publication demonstrates that DuQuant++ yields the lowest per-group xiXjx_i \in X_j5 quantization error xiXjx_i \in X_j6 across all transformer layers at key positions, outperforming both original activations and Hadamard-rotated activations. Histogram analysis confirms that xiXjx_i \in X_j7 consistently redistributes outlier mass and smooths long-tailed distributions.

6. Technical Significance, Limitations, and Future Directions

The alignment of rotation blocks (xiXjx_i \in X_j8) with MXFP4 groups is critical for both error mitigation and computational efficiency. This design eliminates inter-block leakage and obviates global permutations. The single outlier-aware rotation suffices under MXFP4’s independent per-block scaling, differing fundamentally from the dual-rotation pipeline required for integer quantization with a global scale.

A core sensitivity of DuQuant++ is its strict alignment with hardware microscaling group size; non-matching block sizes would undermine both hardware compatibility and error control. Extension of the approach to other formats such as NVFP4 (xiXjx_i \in X_j9, E4M3) is plausible, as is integration into hybrid or future microscaling variants.

Current limitations include the exploration of per-block, data-dependent rotations—which would introduce additional memory overhead—and the lack of direct quantization-aware training for rotation matrices Q(xi)=nearest(xisj,qmin,qmax),Q(x_i) = \operatorname{nearest}\left(\left\lfloor\frac{x_i}{s_j}\right\rceil, q_{\min}, q_{\max}\right),0. Avenues for further work include jointly optimizing Q(xi)=nearest(xisj,qmin,qmax),Q(x_i) = \operatorname{nearest}\left(\left\lfloor\frac{x_i}{s_j}\right\rceil, q_{\min}, q_{\max}\right),1 with fine-tuning and generalizing the method to novel microscaling quantization schemas (Lin et al., 20 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DuQuant++.