Microscaling Floating-Point Formats

Updated 1 July 2026

Microscaling floating-point formats are block-scaled representations that share a single exponent across groups of tensor elements, reducing storage and compute bandwidth.
They partition tensors into blocks with a common scaling factor, balancing precision and quantization error through adaptable block sizes and specialized element formats.
Hardware implementations and post-training quantization benefit from advanced techniques like outlier-aware rotations and scale selection, enabling efficient AI inference with minimal accuracy loss.

Microscaling floating-point formats (often abbreviated as MX formats) are block-scaled data representations that share a single floating-point scale among small groups of tensor elements, enabling substantial reductions in storage and compute bandwidth while preserving a large effective dynamic range. The key insight of the microscaling paradigm is to amortize expensive exponent representation over a block, rather than dedicating an exponent to each tensor element as in IEEE floating-point, or using strictly global scales as in quantized integer arithmetic. MX formats have become widely supported in AI hardware, with reference implementations and native acceleration for 4- to 8-bit variants such as MXFP4 and MXFP8 on modern NVIDIA Blackwell Tensor Cores and RISC-V architectures.

1. Microscaling Format Structure and Encoding

Microscaling floating-point formats partition tensors into blocks of $B$ elements (typically $B=16$ or $32$), with each block sharing a single scaling factor, usually encoded as an 8-bit floating-point value (E8M0: 8 exponent bits, no mantissa bits). Each element within the block is assigned a narrow sub-word floating-point representation (e.g., S1E2M1 for MXFP4: 1 sign, 2 exponent, 1 mantissa bit). The decoded value for element $i$ in block $j$ is

$\hat{x}_i = s_j \cdot m_i$

where $s_j$ is the shared scale, and $m_i$ is the decoded mini-float for element $i$ (Vasilev, 8 Jun 2026).

Block-scale calculation is performed by selecting $s_j$ so that the largest-magnitude element in the block fits the representable dynamic range:

$B=16$ 0

where $B=16$ 1 is the exponent bias (fixed by the block-scale representation) (Lin et al., 20 Apr 2026).

Within the block, quantization proceeds by normalizing each value by $B=16$ 2, rounding and clipping to fit the mantissa/exponent grid, and storing the resulting code. For element $B=16$ 3,

$B=16$ 4

with $B=16$ 5 defined by the element format. Dequantization involves multiplying $B=16$ 6 by $B=16$ 7 (Lin et al., 20 Apr 2026).

2. Trade-offs: Block Size, Precision, and Dynamic Range

The microscaling design introduces explicit trade-offs:

Block size ( $B=16$ 8): Smaller $B=16$ 9 allows finer adaptation to local tensor statistics and better suppression of outliers, but increases metadata overhead (i.e., more shared scales per tensor) and computational cost for GEMM kernels due to more frequent scaling (Lin et al., 20 Apr 2026).
Element format ($32$0, $32$1): Fewer exponent and mantissa bits lower storage/compute cost but increase quantization error, limit representable range, and heighten sensitivity to outliers.
Block-scale format: The block-scale itself is usually quantized (e.g., E8M0, E4M3), introducing quantization noise and range limitations at the block scale as well.

A typical MXFP4 block uses 4 bits/element (S1E2M1) with a single E8M0 (8 bit) scale per 32 elements, yielding 4.25 bits/element average overhead (Vasilev, 8 Jun 2026). Variants such as NVFP4 (NVIDIA) use 16-element blocks with an FP8 E4M3 scale for slightly finer granularity and somewhat higher per-element cost (4.5 bits/element).

The total quantization error in microscaling can be decomposed into element quantization error (discretization of normalized values) and scale quantization/range error. Notably, for extremely small block sizes, discretization of the block scale itself begins to dominate and can paradoxically increase mean-squared error, leading to a "perplexity inversion" where further reductions in $32$2 degrade, rather than improve, model quality (Fasoli et al., 26 Jan 2026).

3. Outlier Suppression and Format-Aware Transformations

A central challenge of block-scaled formats is the impact of outlier elements within a block. A single outlier elevates the shared block scale, compressing the effective dynamic range allocated to the remaining (usually much smaller) values and causing severe quantization error for these non-outlier elements (Lin et al., 20 Apr 2026).

To mitigate this, several strategies have been introduced:

Outlier-aware rotations: DuQuant++ applies a data-dependent block-diagonal rotation ($32$3) to activations, designed to spread outlier energy across dimensions within each microscaling group of size $32$4. This reduces the likelihood that a single coordinate will dominate the shared scale, resulting in significantly lower per-block $32$5 quantization errors. The inverse rotation is pre-absorbed into the weights, guaranteeing a lossless transform aside from quantization (Lin et al., 20 Apr 2026).
Affine preconditioning (SmoothQuant): Shifts dynamic range between activations and weights via diagonal scaling, partially addressing intensity skew.
Fine-grained block rotations (Givens-style): Additional rotation steps further suppress concentrated outliers, but at increased computational cost.
Asymmetric shared scales: AMXFP4 applies separate shared scales for positive and negative parts of each group, compensating for per-block skewness at minimal hardware cost (Lee et al., 2024).
Exponent field repurposing (MX+): MX+ reuses the exponent field of the block-maximum element as extra mantissa bits, enhancing the effective precision of the outlier without additional storage overhead (Lee et al., 16 Oct 2025).

Empirically, as demonstrated on LLaMA-3 and other LLMs, these format-aware and data-aware transforms (e.g., DuQuant++, MX+, asymmetric scaling) yield lower quantization error and perplexity compared to data-agnostic rotations or per-channel quantization regimes (Lin et al., 20 Apr 2026, Lee et al., 2024, Lee et al., 16 Oct 2025).

4. Scale Quantization Effects and Remedies

While decreasing block size typically improves local adaptation, the effectiveness of this strategy is limited by the quantization and dynamic range of the block scale itself. For instance, when using an E4M3 block scale (FP8, 4 exponent + 3 mantissa bits), the smallest nonzero scale is $32$6. If tensor blocks are very low-variance (many elements near zero), the quantized scale may round down to zero, causing catastrophic underflow and increased error—a phenomenon confirmed both experimentally and theoretically (Fasoli et al., 26 Jan 2026).

A key remedy is to adopt an expanded scale format. The unsigned FP8 E5M3 ("UE5M3") format, which repurposes the sign bit for an additional exponent bit (exponent 5, mantissa 3, bias 15), extends scale dynamic range down to $32$7 (Fasoli et al., 26 Jan 2026). This extension nearly eliminates scale quantization error for narrow-distribution blocks, preventing the error inversion effect and matching the accuracy of global scale compensation but with lower hardware and compute cost.

Scale selection algorithms also matter. Whereas traditional MX selects scales based on per-block maxima, recent approaches such as ScaleSearch choose the scale within quantized scale grid points to minimize overall block quantization error, further reducing error compared to the default approach (Gupta et al., 12 May 2026).

5. Post-Training Quantization and Compatibility

Microscaling formats pose unique challenges for post-training quantization (PTQ) relative to integer-based and per-channel floating-point quantization. Empirical studies suggest:

MXFP8 (E4M3) achieves near-lossless performance for standard PTQ algorithms across weight, activation, and even vision-language modalities, with recovery rates >99% and PPL within 0.2 of baseline (Zhang et al., 14 Jan 2026).
MXFP4 (E2M1), despite its hardware efficiency, typically incurs significant accuracy loss unless augmented by format-aware PTQ algorithms such as FlatQuant (learnable affine block shifts), MR-GPTQ (block-wise Hadamard rotations fused into weights), or outlier-aware transforms (DuQuant++, MX+) (Lin et al., 20 Apr 2026, Egiazarian et al., 27 Sep 2025, Lee et al., 16 Oct 2025).
Pre-scale optimization (e.g., multiplying tensors by α ≈ 0.75 before quantization) helps mitigate scale quantization error in MXFP4 pipelines (Zhang et al., 14 Jan 2026).

Rotational and affine preconditioning paradigms that are effective for integer PTQ may underperform or degrade when directly applied to MXFP4, underscoring the need for format-specialized adaptation (Zhang et al., 14 Jan 2026, Egiazarian et al., 27 Sep 2025).

6. Hardware Implementation and Ecosystem

Microscaling formats have been integrated into multiple hardware pipelines:

Compute-in-memory (CIM): MXFormer demonstrates weight-stationary Transformer acceleration using analog CIM arrays with MXFP4-native arithmetic, achieving <1% accuracy degradation without retraining (Karfakis et al., 12 Feb 2026).
RISC-V Extensions: MXDOTP and VMXDOTP are ISA extensions for RISC-V supporting vectorized and scalar MXFP4/MXFP8 dot-product acceleration with minimal area overhead and ≥97% unit utilization (İslamoğlu et al., 19 May 2025, Wipfli et al., 5 Mar 2026).
NVIDIA Blackwell Tensor Cores and AMD equivalents: Support block-based MXFP4, NVFP4, MXFP8, and other OCP MX-compliant block floating-point formats natively, exposing these primitives to frameworks (Vasilev, 8 Jun 2026).
Memory-free floating-to-MXFP converter: End-to-end combinational hardware for 32-way conversion to all major MX types (E5M2, E4M3, down to E2M1) in a single pipeline stage, using only LUTs, no RAM or DSP blocks (Gorodecky et al., 2024).

A comprehensive, vendor-neutral registry of MX and related formats (including conformance vectors and bit-exact test packs) is maintained publicly for engineering compatibility and diagnosis (Vasilev, 8 Jun 2026).

7. Limitations, Alternative Schemes, and Future Directions

Key limitations of standard microscaling designs include:

Sensitivity to outliers: Despite outlier-aware transforms, MXFP4 and similar low-bit formats can still suffer from severe local error if outliers are untamed (Lin et al., 20 Apr 2026).
Block-size and scale-inversion tradeoff: Models with primarily low-variance blocks can see accuracy degradation at extremely fine block sizes due to scale quantization effects (Fasoli et al., 26 Jan 2026).
Groupwise asymmetry: Symmetric block scaling cannot accommodate strong per-block mean-shift or skew, motivating asymmetric shared scales (AMXFP4) and adaptive exponent-mantissa allocation (Lee et al., 2024, Lo et al., 2024).
PTQ fragility: Not all quantization algorithms are compatible—rotation-based and naive PTQ can underperform relative to format-aware algorithms (Zhang et al., 14 Jan 2026, Egiazarian et al., 27 Sep 2025).

Recent extensions and alternatives include:

MX+ and MX++: Exponent field repurposing and finer-grained shared scale structures to capture outlier fidelity without increasing overall storage cost (Lee et al., 16 Oct 2025).
Nanoscaling (NxFP): NanoMantissa, adaptive microexponent, and code recycling further compress the footprint while reducing quantization error, especially in sub-6-bit regimes (Lo et al., 2024).
Static-split alternatives: GoldenFloat proposes an analytic split rule for static exponent/fraction allocation across bit-widths, with an integer-backed Lucas accumulator, as a cross-width hardware solution. Co-existence of this and other static-split or adaptive formats (e.g., Takum, posit) remains an open investigation (Vasiliev, 3 Jun 2026).

Consensus in the literature is that aggressive block floating-point scaling with format-aware, data-aware, and hardware-tailored adaption continues to be an active research area, with new directions in block-scale-free architectures (e.g., AetherFloat (Morisaki, 26 Feb 2026)), quantization-aware training for sub-8-bit deployment, and joint software-hardware co-design for LLMs and transformer accelerators.

Selected References:

DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization (Lin et al., 20 Apr 2026)
Is Finer Better? The Limits of Microscaling Formats in LLMs (Fasoli et al., 26 Jan 2026)
MX+: Pushing the Limits of Microscaling Formats for Efficient LLM Serving (Lee et al., 16 Oct 2025)
AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference (Lee et al., 2024)
Benchmarking PTQ of LLMs under Microscaling Floating Point Formats (Zhang et al., 14 Jan 2026)
An 84-Format Numeric Catalog with Bit-Exact Conformance Vectors (Vasilev, 8 Jun 2026)
Nanoscaling Floating-Point (NxFP) (Lo et al., 2024)
VMXDOTP: A RISC-V Vector ISA Extension for Efficient Microscaling (MX) Format Acceleration (Wipfli et al., 5 Mar 2026)