MXFP4+ (MX+): Advanced 4-Bit Quantization

Updated 19 December 2025

MXFP4+ is a suite of 4-bit quantization enhancements that optimize neural network activations using asymmetric scaling, precision boosting for outliers, and mixed-precision strategies.
It incorporates techniques like Q-EMA and Q-Ramping to stabilize training and reduce weight oscillations, leading to marked accuracy improvements in language and vision models.
Hardware implementations on NVIDIA and FPGA platforms validate MXFP4+ by delivering superior performance, memory efficiency, and competitive accuracy compared to higher precision methods.

MXFP4+, also referred to as MX+, is a collective term for a suite of enhancements, extensions, and variants of the 4-bit Microscaling Floating-Point (MXFP4) format. These techniques address the critical trade-offs inherent to quantizing activations and weights to ultra low precision (4 bits) in modern neural networks, especially for LLMs and vision transformers. MXFP4+ encompasses asymmetric scaling (AMXFP4), precision boosting for outlier elements, oscillation-mitigating quantization strategies, and mixed-precision deployment for improved accuracy and throughput on hardware platforms supporting native MX instructions.

1. Fundamental Structure and Variants

The canonical MXFP4 format encodes each element in a tensor block (typically 32 elements per block) using a 4-bit floating-point representation—E2M1, with a shared 8-bit E8M0 block exponent. The MXFP4+ family introduces several key enhancements:

AMXFP4 (Asymmetric scaling): Each 32-element block is equipped with two 8-bit E5M2 shared scales, one for positive and one for negative values. Each element ([s:E:M]=[1:2:1]) is reconstructed according to its sign, exponent, mantissa, and corresponding scale. This layout mitigates high quantization error due to asymmetrically distributed outliers and improves statistical fit for groupwise values (Lee et al., 15 Nov 2024).
Outlier Repurposing (MX+): In MX+, the block maximum (outlier) element repurposes its exponent field as additional mantissa bits, effectively expanding its mantissa from 1 to 3 bits while the rest of the block remains at 1+2+1 bits. The index of the outlier is tracked per block (8 bits), and storage overhead is minimal (0.25 bits/element). This innovation significantly reduces the quantization error and loss due to poorly represented extreme values (Lee et al., 16 Oct 2025).
Oscillation-Reduced Training: Methods such as Q-EMA (Exponential Moving Average Quantizer) and Q-Ramping (Adaptive Ramping Optimizer) are introduced to stabilize the quantization process during training by damping weight oscillations near quantization thresholds, thus recovering lost convergence and preserving accuracy (Chen et al., 28 Feb 2025).
Mixed-Precision Assignment: MicroMix and similar frameworks partition channels or blocks across MXFP4, MXFP6, and MXFP8 formats, guided by quantization error thresholds. The chosen allocation achieves a favorable balance of memory, compute, and accuracy across heterogeneous model components (Liu et al., 4 Aug 2025).

The following table summarizes principal MXFP4+ representations and features:

Variant	Key Feature(s)	Group Scale(s)	Outlier Handling	Storage (bits/el.)
MXFP4	Symmetric E2M1, 1× 8-bit E8M0	1 per block (absmax)	None	4
AMXFP4	Asymmetric E2M1, 2× 8-bit E5M2	2 per block (±split)	None	4.5
MX+	Outlier ext. mantissa (3 bits)	1 per block, BM index	Outlier mantissa boost	4.25
MXFP4+ (MicroMix style)	Mixed-format per-channel	1 per block/channel	Thresholded assignment	variable

2. Quantization and Dequantization Schemes

All MXFP4+ variants operate by normalizing real-valued elements using groupwise shared scales and then rounding to the nearest representable codeword. For standard MXFP4 and MX+ (Lee et al., 16 Oct 2025, Samson et al., 1 Jul 2024):

Scale selection: $s = 2^{\lfloor\log_2\max_{i \in \text{block}} |x_i|\rfloor - b}$ (with bias $b$ ).
Element encoding (non-outlier): $x_i = \left(1 + \frac{m_i}{2^{b_m}}\right) \times 2^{e_i-\text{bias}} \times s$ .
Element encoding (outlier, MX+): $x_{BM} = m_{BM}^{\text{ext}} \times 2^E$ where $m_{BM}^{\text{ext}} = 1 + \frac{M_{BM}}{2} + \frac{X}{8}$ (1 mantissa bit + 2 recycled exponent bits).
AMXFP4 encoding: Each element is rescaled by either $S^+$ or $S^-$ according to its sign, with groupwise exponent rounding and 2-bit mantissa selection for each scale (Lee et al., 15 Nov 2024).

Quantization error and rounding are controlled via tie-to-even or stochastic rounding, per the hardware target and training methodology.

3. Outlier, Asymmetry, and Oscillation Handling

Quantizing to 4 bits accentuates the vulnerability to extreme activation/weight outliers:

Groupwise Outliers: In the canonical BFP/MXFP4 scheme, the block maximum (BM) dominates the shared scale. Smaller-magnitude elements are compressed into the lowest few representable values, amplifying quantization error.
Precision Recovery via Outlier Repurposing (MX+): The BM element encodes more mantissa bits by repurposing its exponent, rectifying the representational "bottleneck" for outlier elements with negligible impact on the block's overall bitwidth and computation (Lee et al., 16 Oct 2025).
Asymmetric Scaling (AMXFP4): Splitting positive and negative sub-blocks allows smaller maximal scales, reduces mean squared error for each sub-distribution, and achieves empirical gains in perplexity and accuracy over symmetric MXFP4 (Lee et al., 15 Nov 2024).
Oscillation Suppression: Training in low bitwidth (FP4) regimes can induce persistent weight oscillation—"flip-flop" dynamics—where weights cross static quantization thresholds and fail to converge. Methods such as Q-EMA (anchoring quantized value to a slow-moving average) and Q-Ramping (batch and LR scaling for high-oscillation elements) have been shown to lower oscillation ratio $R_w$ , substantially decreasing accuracy loss in vision transformer training settings (Chen et al., 28 Feb 2025).

4. Hardware Implementations and Integration

MXFP4+ formats have been realized across multiple hardware and software regimes:

NVIDIA Blackwell and B200 Architectures: Native MXFP4 and its MX+ outlier extension are implemented with minimal overhead. BM element identification and extended mantissa datapath add <1% area and sub-1% latency relative to vanilla MXFP4, with a performance uplift due to the accuracy benefits (Lee et al., 16 Oct 2025).
FPGA Designs: Open Compute Project MX standards, extended to MXFP4+ (MX⁺), are supported in open-source FPGA IP. Quantization and compute pipelines leverage pipelined Dot and DotGeneral blocks with MX-aware arithmetic, achieving a demonstrable resource/accuracy trade-off on tasks such as ResNet-18/ImageNet (Samson et al., 1 Jul 2024).
MicroMix Mixed-Precision GEMM: All MXFP4+ variants are compatible with mixed-precision GEMM kernels, which split tensor channels across MXFP4/6/8 formats as dictated by quantization error thresholds. These kernels issue fused MMA instructions for each group, accumulate in BF16 or FP32, and outperform FP8-based INTGEMMs in both throughput and memory pressure (Liu et al., 4 Aug 2025).
Software Support: CUDA, CUTLASS, Triton, PyTorch (Brevitas) host quantization and dequantization kernels designed to handle MXFP4+ packing, extended mantissa logic, BM indexing, and scale computation.

5. Empirical Impact and Performance

MXFP4+ techniques consistently demonstrate substantial improvements in both accuracy and efficiency:

LLM Inference: MX+ (outlier repurposing) improves perplexity on LLaMA-3.1-8B from 27.7 (MXFP4) to 9.5 (MX+), and increases zero-shot accuracy for OPT-66B from 35% (MXFP4) to 62% (MX+), matching or exceeding MXFP6 at lower bitwidth (Lee et al., 16 Oct 2025).
Activation Outlier Scenarios: AMXFP4 achieves 3–10 point accuracy gains on tasks such as VQA-T, DocVQA, and CSQA, and 3% absolute improvement on MMLU benchmarks compared to symmetric MXFP4 (Lee et al., 15 Nov 2024).
Vision Transformer Training: TetraJet with Q-EMA and Q-Ramping halves the MXFP4-induced accuracy drop (improving Top-1 on DeiT-Small from 71.03% to 72.25%; over 50% reduction in the loss relative to full-precision) (Chen et al., 28 Feb 2025).
Resource Efficiency: On FPGA, MXFP4+ achieves a Pareto frontier balance: ≈2.6 pp off FP32 for ResNet-18 (QAT), with a 20% LUT area reduction versus FP6 (Samson et al., 1 Jul 2024).
Mixed-Precision Deployment: MicroMix (MXFP4+) achieves >95% of FP16 accuracy in zero/few-shot LLM tasks and yields 8–46% kernel-level speedup over TensorRT-FP8, with ~20% memory savings (Liu et al., 4 Aug 2025).

6. Limitations and Open Research Questions

Several open problems remain in the deployment and further optimization of MXFP4+:

Dynamic Group Sizing: All MXFP4+ variants fix group size at 32; the impact of smaller or adaptive groupings, particularly for skewed modalities (e.g. speech), remains unexplored (Lee et al., 15 Nov 2024).
Hardware Adoption: Asymmetric scale and outlier handling require minor, though non-zero, hardware modifications; area and energy impact are small but are not always negligible (Lee et al., 15 Nov 2024, Lee et al., 16 Oct 2025).
Format Specialization and Rotation: While MR-GPTQ and blockwise Hadamard rotation can bridge accuracy gaps, their combined effectiveness is format- and group-size-dependent (Egiazarian et al., 27 Sep 2025).
Unexplored Formats: Extensions to ternary (FP3) or alternative mantissa/exponent partitioning may yield yet further area/accuracy trade-offs but currently lack empirical evidence (Lee et al., 15 Nov 2024).
Task and Domain Generality: Most results are presented for language and vision domains; highly skewed or adversarial data distributions may expose further weaknesses or demand additional enhancements.

7. Summary and Outlook

MXFP4+ (MX+) formats collectively represent the state-of-the-art in 4-bit microscaling quantization for neural network deployments, combining custom element layouts (asymmetric, extended mantissa), quantization-aware training, outlier compensation, and hardware-software co-design. By optimally handling the representational limitations of block floating-point in the presence of outliers, MXFP4+ recovers accuracy losses previously considered inherent to FP4, while maximizing performance on emerging hardware such as NVIDIA Blackwell, FPGAs, and custom ASICs. Ongoing research continues to refine block-wise algorithms and hardware micro-architectures, and to broaden empirical validation across new data regimes (Chen et al., 28 Feb 2025, Lee et al., 15 Nov 2024, Lee et al., 16 Oct 2025, Liu et al., 4 Aug 2025, Samson et al., 1 Jul 2024, Egiazarian et al., 27 Sep 2025).