Microscaling (MX) Family Overview
- Microscaling (MX) family comprises block-based quantization formats that encode low-precision tensors using a shared scale per block, reducing memory requirements and computational overhead.
- It supports various bitwidth variants and block sizes to balance dynamic range, precision, and efficiency, making it ideal for large language models and transformers.
- The approach integrates quantization with hardware-accelerated dot-products and fused arithmetic operations, achieving significant throughput improvements and energy efficiency with minimal accuracy loss.
Microscaling (MX) Family
The Microscaling (MX) family comprises a suite of block-based quantization formats designed for efficient representation and computation of low-precision data in deep learning models, particularly LLMs and transformer architectures. MX formats amortize scale representation by associating a single shared scale (typically a power-of-two exponent) to a block of elements (usually 16–64), with each element encoded in a compact floating-point or integer mini-format. This design enables significant reductions in memory footprint and computational bandwidth, delivering near-FP16 dynamic range with only 8-bit or lower per-element storage. By tightly coupling quantization, dot-product arithmetic, and hardware support, MX formats form the core of recent industry-standard low-precision inference and training pipelines (Cococcioni et al., 2 Oct 2025).
1. Core Definition and Data Encoding
MX encodes a dense tensor as a partitioned set of blocks, each with a single shared scale and per-element payloads in a low-bitwidth format. Consider a vector partitioned into blocks of size , with block containing elements . Each block stores:
- A shared scale , where is chosen to match the block's maximum magnitude:
is the bias parameter derived from the element mini-format's maximum unbiased exponent.
- per-element quantized mantissas 0 in an 8-bit (or fewer) float or integer format. Typical element formats: E4M3, E5M2, E3M4 (numbers indicate exponent/mantissa bits).
Quantization and reconstruction proceed via: 1 with block reconstruction: 2 The quantization error per element satisfies 3 (Cococcioni et al., 2 Oct 2025). For integer (MXINT8) and floating-point (MXFP{8,6,4,2}) variants, bit layouts and supported ranges follow the OCP MX standard.
2. Block Size, Bitwidth Variants, and Precision-Range Trade-offs
Block size 4 critically determines the balance between scale-overhead and adaptivity:
- Large 5 yields higher memory savings (scale overhead amortized over more elements), but increases risk of local over/underflow if dynamic range within the block is too broad.
- Small 6 allows finer local adaptation to value distributions, at the cost of scale-storage overhead.
Concretely:
| Format | Total Bits | Exponent Bits | Mantissa Bits | Dynamic Range (approx) |
|---|---|---|---|---|
| MXFP8 E5M2 | 8 | 5 | 2 | 7 |
| MXFP8 E4M3 | 8 | 4 | 3 | 8 |
| MXFP6 E3M2 | 6 | 3 | 2 | 9 |
| MXFP6 E2M3 | 6 | 2 | 3 | 0 |
| MXFP4 E2M1 | 4 | 2 | 1 | 1 |
| MXINT8 | 8 | (none) | 8 (signed) | 2 |
The E4M3 and E5M2 formats minimize overflow risks but degrade small-value accuracy due to limited fractional resolution, while E3M4 improves precision at the cost of increased saturation. Block sizes 3 are common, with 4 frequently selected as a sweet spot (Cococcioni et al., 2 Oct 2025).
3. Algorithmic Workflows and Integration
Encoding and matrix multiply workflows:
Block Quantization (pseudocode):
Input: real[] X (length B), exponent limit ξ
p = max_i |X[i]|
w = floor(log2(p)) - ξ
S = 2^w
for i = 1..B:
q[i] = round_to_nearest(X[i] / S)
return (S, q)
4c++
Input: S, q[1..B]
for i = 1..B:
Xhat[i] = S * q[i]
return Xhat
MX-enabled matrix multiply follows a blockwise tiling of weights and activations, with fused integer/floating MACs and shared-scale products; see pseudocode section in (Cococcioni et al., 2 Oct 2025).
In integrated transformer models (e.g., GPT-2), all matrix-multiply (GEMM) paths are replaced by MX-quantized blocks. A master FP32/BF16 copy is retained for optimizer updates, while inference proceeds in pure MX or MX+BF16 as stability allows. Non-matrix-multiply units (token embedding, softmax, loss) remain in BF16/FP32 to avoid underflow/overflow.
4. Empirical Results, Performance, and Hardware Acceleration
Model accuracy: On language modeling (GPT-2-124M, Tiny Shakespeare), fine-tuned MX with float16 mantissas produced only 0.48% relative loss versus FP32; pure E4M3 mantissas gave larger increases (+25.66%), especially when applied universally, owing to insufficient activation precision and softmax instability. Keeping activations and softmax in BF16 and applying MX only to weights and matmuls yields single-digit relative error increases and preserves text generation quality.
Compute and memory: MX reduces parameter and state memory by %%%%2223%%%% compared to BF16, with MX dot-product routines amenable to hardware-accelerated, shared-scale, tiny-int arithmetic for %%%%2425%%%% higher throughput, provided hardware supports fused scale-dot-paths.
Hardware units: MX-specific dot-product (MAC) units in RISC-V vector/streaming architectures (MXDOTP/VMXDOTP) natively support MXFP8/MXFP4 dot-accumulates, attaining 79 speedup and up to 50 better energy efficiency over software implementations. Physical realizations achieve near-peak (>95%) functional unit utilization at modest area/power cost (e.g., 356 GFLOPS/W at 1 GHz in 12nm FinFET, <8% area premium) (Wipfli et al., 5 Mar 2026, İslamoğlu et al., 19 May 2025).
Block size and mantissa width: Empirical MRI FFT studies show that block sizes 1 and mantissa widths 23 yield most of the accuracy benefit, with diminishing returns for larger 3 (Deveshwar et al., 3 Dec 2025).
5. Known Limitations and Stability Considerations
- Outlier block inflation: In low-bit MX (e.g., MXFP4/6), a single large-magnitude outlier can "inflate" the block scale, reducing effective precision for inliers and triggering clamping/overflow. This produces accuracy degradation in LLMs, particularly for activations (Lee et al., 16 Oct 2025).
- Fractional precision for activations: E4M3/E5M2 may lack sufficient fractional bits for accurate activation representation; custom formats (E3M4, extended mantissas) or hybrid MX+BF16 schemes can ameliorate these issues.
- Full-network low-bit quantization: Quantizing all network components (including nonlinearities such as softmax/loss) to tiny MX (E4M3/E5M2 only) leads to catastrophic divergence and model failure.
- Random access and blockwise alignment: MX formats are optimized for streaming/blockwise access; random access (e.g., for model backward passes) requires block-aware transposition/construction to maintain efficiency.
- Stability in training: Stochastic instabilities in loss can occur during large-scale training when layer-norm affine or activation blocks become tightly clustered. Practical solutions: restrict MX quantization to weights, keep activations in BF16, or apply MX only to the forward path, not backward gradients (Su et al., 25 Jun 2025).
6. Extensions, Future Work, and Public Resources
Advanced MX variants introduce metadata-augmented formats (e.g., MX+ for outlier tracking, M²XFP for subgroup-level scale refinement) that selectively increase precision for the block maximum or best-fit elements, addressing block inflation and improving LLM perplexity/memory tradeoffs at negligible storage penalty (Lee et al., 16 Oct 2025, Hu et al., 27 Jan 2026).
Hardware and compiler support for MX is expanding rapidly, including open-source reference designs for conversion and arithmetic datapaths, FPGA libraries supporting all major MX formats, and Brevitas-integrated quantizers for PyTorch (Samson et al., 2024, Cococcioni et al., 2 Oct 2025). The public codebase for the reference MX GPT-2 integration is available at https://github.com/unipi-dii-compressedarith/LLM.c-sve, with full support for format and block-size selection.
Ongoing research avenues include dedicated hardware block-scale loaders and fused dot-product kernels, adaptive or learned per-block sizes, alternative shared-scale regimes (e.g., log-domain), and improved rounding schemes to reduce quantization bias. Further development targets robust low-bit MX quantization compatible with sophisticated training dynamics and nonlinearities, with generalizable solutions for activation/weight/tensor block distributions (Cococcioni et al., 2 Oct 2025).