Papers
Topics
Authors
Recent
2000 character limit reached

AMXFP4 Asymmetric Scaling

Updated 22 December 2025
  • AMXFP4 is a low-bit block floating-point quantization extension that reallocates outlier exponent bits for improved mantissa precision.
  • It mitigates the outlier quantization bottleneck in LLMs by repurposing a dominant block element’s bits, reducing error in 4-bit representations.
  • Empirical results show MX+ nearly matches 6-bit accuracy while incurring minimal storage and computational overhead in both software and hardware deployments.

Asymmetric Scaling (AMXFP4), commonly referred to as MXFP4+ or simply MX+, is a low-bit block floating-point (BFP) quantization extension that addresses the outlier quantization bottleneck in microscaling (MX) family formats for efficient LLM inference. The MX+ approach leverages the observation that within each BFP block, there is typically a single dominant outlier (Block Max, BM) whose exponent bits can be repurposed to enhance its mantissa precision without altering the encoded information for the remaining elements. This methodology enables 4-bit BFP quantization to reach accuracy levels previously restricted to 6- or 8-bit settings, with negligible storage or computational overhead and without intrusive changes to hardware or software stacks (Lee et al., 16 Oct 2025).

1. Block Floating-Point Quantization and MX Family Formats

The BFP framework organizes tensors into blocks (commonly 32 elements), encodes each block with a shared base exponent (“shared_exp”) and per-element mini-floats to represent fine differences. In MXFP4 (E2M1), each element is represented by 1 sign bit, 2 exponent bits, and 1 mantissa bit, while the shared exponent is stored with higher precision (E8M0, 8 bits, no mantissa). MXFP6 and MXFP8 are higher-precision variants (e.g., E2M3 or E4M3 encoding).

Format Per-Element Bits Mini-FP Structure Shared Exp
MXFP4 4 E2M1 (1+2+1) E8M0 (8 exponent bits)
MXFP6 6 E2M3/E3M2 E8M0
MXFP8 8 E4M3/E5M2 E8M0

BFP quantization enables very low precision at high throughput, but suffers when one element in a block is an outlier, as the shared exponent is dictated by the largest value. This leads to a lossy representation where the outlier is quantized coarsely (due to low mantissa precision), and non-outlier values (NBMs) are aggressively downscaled, often becoming zero.

2. Outlier Bottleneck in MXFP4 and Motivation for Asymmetric Scaling

When quantizing LLM weights or activations using MXFP4, block-level outliers dictate a high shared_exp, resulting in:

  • The BM being quantized with only 1 mantissa bit (high approximation error).
  • NBMs in the same block mapped to zero or high error due to aggressive scaling.
  • For language modeling, such blocks can dominate the mean squared error (≈80% of the total error), leading to catastrophic perplexity loss (Lee et al., 16 Oct 2025).

Empirical blocks from models such as Llama-3.1-8B show this phenomenon, further motivating the need for a fine-grained treatment of outliers in BFP block formats.

3. The MX+ (AMXFP4) Outlier-Repurposing Scheme

MX+ exploits the saturation property of the BM's exponent field in every block:

  • The BM exponent field in E2M1 is always set to its maximum value (e_max=2), so its bits are redundant.
  • MX+ reclaims the 2 exponent bits from the BM and reallocates them as extra mantissa bits (E0M3: 1 sign, 0 exponent, 3 mantissa bits for the BM).
  • All NBMs retain their original E2M1 encoding.

Implementation requires only a single additional metadata byte per block to encode the BM index (5 bits + 3 reserved bits), with an average overhead of 0.25 bits per element.

Encoding/Decoding Formulas:

  • Standard: x^i=(1)si(1+mi/21)2shared_exp+(eibias)x̂_i = (−1)^{s_i}\cdot(1 + m_i/2^{1}) \cdot 2^{shared\_exp + (e_i−bias)} for iBMi\neq BM.
  • MX+ for BM: x^BM=(1)sBM(1+mBM/23)2shared_exp+(emaxbias)x̂_{BM} = (−1)^{s_{BM}} \cdot (1 + m'_{BM}/2^3)\cdot 2^{shared\_exp + (e_{max}-bias)}, where mBMm'_{BM} is the 3-bit extended mantissa.

This approach maintains 32 × 4 bit alignment, introducing no intrusive changes to the memory layout or compute pipeline.

4. Algorithmic Integration and Software/Hardware Support

MX+ maintains MXFP4 compatibility and enables seamless integration by extending only the local per-block decode kernel—no additional tensor-level or batch-level API changes are needed. In practice, the only modification to MXFP4 decoding is the addition of a branch for the BM, efficiently realized in CUDA or Triton with a ~20 line patch.

On hardware, NVIDIA Blackwell Tensor Cores support the extension via a combined dot-product and a minor “forward and swap” (FSU) plus BM compute unit (BCU). These units sit outside the high-throughput pipeline, adding only 0.020 mm² area and 12 mW power per Tensor Core. Minimal SASS/PTX grammar extensions are required: one extra bit in the MMA opcode and two registers for BM indices.

5. Quantitative Impact and Empirical Results

MX+ restores the representational accuracy of 4-bit MX quantization nearly to 6-bit levels, without notable storage or runtime penalties:

  • For language modeling on WikiText-2 and C4, MXFP4 yields perplexity >200>200 for 2048-token models, while MX+ reduces perplexity by an order of magnitude, recovering over 90% of the gap to BF16.
  • On downstream tasks (ARC, Lambada), MX+ improves accuracy by up to 42% relative to MXFP4 and comes within 1% absolute of MXFP6 performance.
  • Storage: MXFP4+ uses 136 bits per 32-element block (4.25 bits/element, including the metadata), compared to 128 bits for MXFP4.
  • Inference: Software-only decode on an A6000 GPU incurs just 1–8% additional kernel time; on a 5090 GPU (CUTLASS + vLLM), the end-to-end slowdown is up to 1.13× in memory-bound regimes.
  • Hardware-accelerated prefill adds ≤ 0.4% to throughput, a negligible penalty compared to the gain in accuracy.

6. Deployment Recommendations and Limitations

  • For users employing BF16 compute with MX decode, MX+ is a drop-in upgrade over MXFP4, achieving much of the MXFP6 accuracy at almost identical bit rates.
  • For hardware-native MX systems, updating to MXFP4+ results in <0.5% throughput cost.
  • In highly outlier-rich regimes, MXFP4++ can be used to repurpose 3 metadata bits to track a second outlier, but in practice, tracking the top-2 outliers yields minimal further gains.
  • MX+ fully conforms to the OCP MX specification aside from the optional metadata byte, and requires no tensor or channel pre-scaling.

7. Broader Context and Implications

MX+ exemplifies the broader class of “outlier-repurposing” approaches in quantization, where precision lost to outlier dominance (asymmetry in value distribution) is reclaimed in a non-intrusive, block-local manner. This enables true ultra-low-bit quantization (4 bits) of both weights and activations in LLMs, with engineering complexity strictly localized to the quantization and dequantization steps. MX+ does not interfere with quantizer calibration, supports both CPU and GPU deployment, and can be mapped into any architecture with MX support (Lee et al., 16 Oct 2025). A plausible implication is that future quantization techniques for deep learning accelerators may generalize this form of asymmetric scaling to further minimize the impact of blockwise statistical extremes, especially as model and tensor sizes continue to grow.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Asymmetric Scaling (AMXFP4).