MXFP4: Efficient 4-bit Data Format

Updated 30 July 2025

MXFP4 is a low-precision data format based on 4-bit FP4 values with shared block-level scaling, enabling efficient hardware implementation.
It drastically reduces memory and compute demands for scientific computing and deep neural network inference while preserving model fidelity.
Advanced techniques like stochastic rounding and comparator trees in MXFP4 pave the way for adaptive quantization innovations and robust training.

MXFP4 is a low-precision data format founded within the “microscaling” (MX) family, enabling efficient hardware implementation and fine-grained quantization for both scientific computing (e.g., electronic-structure data management) and deep neural network applications. Its core principle is to encode blocks of data using 4-bit floating-point (FP4) values, each sharing a block-level scale, thereby maximizing memory efficiency without drastically impairing numerical range or model fidelity.

1. Structural Definition and Mathematical Foundations

MXFP4 adopts an E2M1 numerical format: each element is a 4-bit FP4 value (2 exponent bits, 1 mantissa bit, 1 sign) accompanied by a block-shared scaling factor. The canonical block size is $k=32$ . The shared scale $S$ per block is typically restricted to a power-of-two representation (“E8M0,” i.e., only an 8-bit exponent, no mantissa), so $S \in \{2^n : n \in \mathbb{Z}, n < 2^8\}$ . The block’s quantized values $\mathbf{X}_q$ are interpreted in real values as

$\mathbf{X} = S \cdot (\mathbf{X}_q - Z)$

where $Z = 0$ in standard MXFP4, and all quantized values within a block share the same $S$ and $Z$ (Samson et al., 1 Jul 2024, Lee et al., 15 Nov 2024).

The quantization process proceeds by:

Determining $S$ $S$ for each block:
- $S$ is set such that, after $S^{-1}$ scaling, the block’s maximal magnitude aligns with the representable FP4 max (per E2M1 spec).
Rescaling and quantizing:
- Each element $a_i$ in a block is converted via $a'_i = \text{Quantize}(a_i / S)$ , using “round-to-nearest-even” or stochastic rounding (Tseng et al., 27 Feb 2025).
Storage:
- Each $k$ -element block requires 4 bits per FP4 value plus a single 8-bit scale.

This strategy allows a reasonably wide dynamic range (limited only by the scaling factor and limited FP4 dynamic) while keeping both storage and compute requirements low.

2. Hardware Implementation and Algorithmic Operations

The MXFP4 arithmetic is tailored for efficient FPGA and deep learning accelerator implementation (Samson et al., 1 Jul 2024):

Dot Product: For two blocks $A, B$ and their respective scales $s, t$ , the dot product is given by

$\text{Dot}(A, B, s, t) = (s t) \sum_{i=1}^{k} A_i B_i$

The scaling makes multiplication within the block effectively integer arithmetic, and the final scale multiplication—being a power-of-two—can be implemented via fast shift operations.

Conversion Circuits: To support conversion between FP32/BF16 tensors and MXFP4, blocks are normalized by their shared scale, quantized to FP4 with prescribed rounding, and optionally compressed.
Comparator Trees: When determining the optimal scale $S$ for a block, a pipelined comparator tree (depth $\lceil \log_2 k \rceil$ ) is used to find the largest exponent for all $k$ elements efficiently.

The MXFP4 design thus minimizes resource utilization on FPGAs and provides deterministic mapping between real-valued and quantized domains. Area and power requirements are further reduced since scaling can be performed with bitwise shifts.

3. Quantization in Neural Network Inference and Training

3.1 Inference

MXFP4 is employed in neural network inference to enable blockwise quantization of both weights and activations, yielding drastic memory and bandwidth reductions (Lee et al., 15 Nov 2024, Jang et al., 2 Jan 2025):

Block granularity (e.g., $k=32$ ) allows outlier suppression; individual outlier values do not dictate the scaling of an entire tensor or channel.
When compared to alternatives (per-channel or per-tensor quantization), MXFP4 delivers lower FPGA area costs and flexible hardware–software co-design.
Quantization-aware training (QAT) can recover accuracy lost due to aggressive quantization—especially critical for smaller FP4 representations.

Empirical results show that:

MXFP4, although outperforming naive per-tensor quantization, suffers from performance degradation in the presence of group-wise data asymmetry or severe activation outliers (Lee et al., 15 Nov 2024).
Advanced alternatives, such as AMXFP4 and DialectFP4, directly address these issues using asymmetric shared scales or per-block formatbooks, further improving inference robustness and accuracy (Jang et al., 2 Jan 2025).

3.2 Training

Direct use of MXFP4 for DNN training leads to convergence degradation unless sophisticated stochastic algorithms are employed (Tseng et al., 27 Feb 2025, Chen et al., 28 Feb 2025):

Stochastic Rounding (SR): Applying unbiased stochastic rounding ensures that, in expectation, the quantized value is equal to the original pre-quantized value—mitigating bias in weight and gradient updates.
Random Hadamard Transform (RHT): To curb variance magnification due to block-level outliers, the RHT disperses energy evenly across the block before quantization, making the quantization error's variance depend only logarithmically on block size rather than on any single large entry.
Oscillation Mitigation: Techniques such as EMA Quantizer (Q-EMA) and Q-Ramping reduce parameter oscillations near quantization thresholds by quantizing using an EMA of the weights or by adaptively increasing the batch size for frequently oscillating weights (Chen et al., 28 Feb 2025).

When these strategies are combined, models (e.g., GPT with up to 6.7B parameters) trained using MXFP4 can approach the quality of mixed-precision BF16 runs.

4. Comparative Developments: Variants and Alternatives

MXFP4 serves as a baseline for several recent innovations:

AMXFP4 (Lee et al., 15 Nov 2024): Introduces asymmetric scaling (distinct positive/negative scales per block) to reduce quantization errors for skewed activation distributions encountered in LLM inference; this method delivers improved perplexity and accuracy compared to symmetric MXFP4.
DialectFP4 in BlockDialect (Jang et al., 2 Jan 2025): Employs a “formatbook” allowing each block to select an optimal format among multiple FP4 dialects, thus improving representation for irregular data distributions. This mixed-format system, when combined with per-block fast dialect selection logic, achieves lower perplexity and higher zero-shot accuracy, while retaining the low hardware footprint of integer MAC units.
Comparison Table

Format	Block Scaling	Data Symmetry	Adaptivity	Accuracy (LLM)
MXFP4	Power-of-two, per block	Symmetric	Fixed E2M1	Moderate—drops with outliers
AMXFP4	Asymmetric, per block	Asymmetric	FP8 scaling	+3% vs MXFP4 on VQA
DialectFP4	Per-block, formatbook	Variable	Fine-grained	Up to +11.4% vs MXFP4

These alternatives generally deliver accuracy improvements and better resilience to value distribution irregularities, with hardware cost increasing only modestly compared to the base MXFP4 scheme.

5. Scientific Data and Standardization Context

In computational materials science, MXFP4 may be integrated with hierarchical metadata standards such as the NOMAD Meta Info and ESCDF frameworks (Ghiringhelli et al., 2016):

Data structured using sections (e.g., section_run, section_method, etc.) and standardized, SI-based units, ensuring interoperability across diverse simulation codes.
The MXFP4 format could encapsulate scalar field data or restart information, annotated with code-independent metadata to allow for searchability and cross-code data exchange.
Its block-structured compactness and capacity for hierarchical storage make it compatible with formats like HDF5 and JSON, which underpin ESCDF.

A plausible implication is that the modular design of MXFP4 supports its straightforward integration into established materials science data infrastructures, promoting unified data exchange across computational domains.

6. Current Limitations and Future Directions

While MXFP4 provides a generic, hardware-friendly quantization baseline, it is subject to several technical challenges:

Asymmetry and Outliers: Symmetric, block-based scaling may not sufficiently address highly asymmetric or outlier-rich data distributions. Emerging strategies such as AMXFP4's dual-scale quantization and DialectFP4's formatbook approach are currently favored in LLM inference for their empirically superior performance.
Training Stability: Naive MXFP4 quantization induces training instability due to weight oscillations; integrating stochastic, double quantization, and oscillation reduction heuristics is essential for achieving parity with higher-precision methods.
Deployment in Scientific Codes: For applications in computational materials science, aligning MXFP4’s semantics (block grouping, scaling conventions, data typing) with community standards ensures successful integration and data FAIRness.

Ongoing research targets broader adaptivity (e.g., real-time quantization parameter selection, mixed-format and dynamically adaptive quantization schemes), improved oscillation suppression techniques, and routine support for MXFP4 in both machine learning and scientific computing software and hardware ecosystems.

7. Impact and Prospects

MXFP4 typifies the convergence between data format efficiency, hardware specialization, and algorithmic flexibility. By compressing numerical representations to 4 bits with blockwise power-of-two scaling, MXFP4 achieves significant reductions in storage and compute demands. Its evolving role, now as a baseline for asymmetric, adaptive, and mixed-format quantization schemes, reflects the field’s trajectory toward ever more efficient and robust low-bit computation. The proliferation of open-source tools (e.g., PyTorch/Brevitas integration, accelerator kernels) and its adoption in both next-generation neural systems and scientific data infrastructures underline its continued relevance and adaptability across application domains.

In summary, MXFP4 is an enabler of low-precision, high-throughput computation in both machine learning and scientific data contexts. Its modular design, hardware-software alignment, and role as a foundation for ongoing quantization method innovation position it as a critical component in the progression toward efficient, scalable AI and scientific computing.