FP4 Microscaling Attention
- FP4 Microscaling Attention is a technique that uses block-wise low-precision (4-bit) quantization with per-block shared scaling to efficiently implement Transformer attention mechanisms.
- It combines custom FP4 data formats like MXFP4 and NVFP4 with quantization methods that control errors and dynamic range, enabling high throughput on specialized hardware.
- The approach delivers significant speed, memory, and energy benefits while recovering over 95% of full-precision accuracy in both inference and training setups.
Floating-Point 4-bit (FP4) Microscaling Attention refers to a collection of quantization and computation techniques for efficiently implementing the attention mechanisms in Transformer-based neural networks using block-wise low-precision floating-point formats. These methods capitalize on recently introduced hardware support for FP4 (notably on NVIDIA Blackwell GPUs) and leverage per-block “microscaling” for both weights and activations, while tightly controlling quantization error and dynamic range limitations inherent to ultra-low-bit arithmetic. The principal design goal is to achieve near full-precision accuracy for attention while gaining substantial throughput, energy, and memory advantages.
1. Microscaling FP4 Data Formats and Quantization Methods
Microscaling formats—exemplified by MXFP4 and NVFP4—combine a compact per-element floating-point core (1 sign bit, 2 exponent bits, 1 mantissa bit; overall 4 bits, type E2M1) with a per-block shared scale (typically FP8). A typical block contains (MXFP4) or (NVFP4) contiguous values, each quantized relative to the block’s absolute maximum magnitude: Each value is then mapped to the nearest representable FP4 number: The group scale is quantized to FP8 (E4M3/E5M3) and stored per block. At computation time, blockwise dequantization reconstructs activations and weights as .
Table: Format comparison (all E2M1 elements)
| Format | Block Size | Scale Type | Scale Bits | FP4 Element Bits | Per-block Overhead (bits) |
|---|---|---|---|---|---|
| MXFP4 | 32 | E8M0 (PoT) | 8 | 4 | 0.25/elem for scale |
| NVFP4 | 16 | FP8 (E4M3) | 8 | 4 | 0.5/elem for scale |
| AMXFP4 | 32 | FP8 (E5M2) ×2 | 16 | 4 | 0.5/elem for dual scales |
AMXFP4 introduces separate positive and negative FP8 scales to mitigate group-wise asymmetry in the block’s value distribution (Lee et al., 2024).
2. Application in Transformer Attention and Integration Patterns
FP4 microscaling is employed to quantize all linear projections in the multi-head attention block:
- ,
- ,
- ,
- ,
- Output: .
Each multiplication uses two tensor operands (e.g., , ), each partitioned and quantized in blocks; hardware units (e.g., Blackwell Tensor Cores) consume these blocks and scales directly, multiplying , , and applying the product of their FP8 scales in fused multiply-accumulate (GEMM) instructions (Liu et al., 4 Aug 2025, Castro et al., 20 May 2025, Zhang et al., 16 May 2025).
Softmax is usually implemented in higher precision (BF16/FP16/FP32) to avoid dynamic-range related underflows/overflows; the post-softmax result may be further quantized by a secondary per-group scale, especially where the output will be immediately consumed by subsequent FP4 GEMMs (Zhang et al., 16 May 2025).
Plug-and-play software/kernels for attention with FP4 microscaling are designed as drop-in replacements for existing FlashAttention or PyTorch/xFormers kernels, with only minor model or config changes (Zhang et al., 16 May 2025).
3. Outlier and Dynamic Range Mitigation
Outlier elements in neural activations or weights pose a significant challenge with aggressive (4-bit) quantization since the shared group scale can be dominated by one large value, reducing representational fidelity for all other elements. Recent developments target these effects:
- Block Max Precision Boost (MX+): For each block, the largest-magnitude value (BM) is given additional mantissa bits by repurposing its exponent field, dramatically reducing quantization error for the outlier while leaving others unchanged. MXFP4+ achieves FP4 throughput with dramatically reduced perplexity and accuracy loss over plain MXFP4, with empirically negligible overhead (Lee et al., 16 Oct 2025).
- Asymmetric Scaling (AMXFP4): By having separate shared FP8 scales for positive and negative subgroups within each block, AMXFP4 adapts to asymmetric distributions, which frequently occur in post-activation blocks (Lee et al., 2024).
- Hierarchical Scaling (NVFP4): Employs both global (per-tensor) and local (per-block) scaling, with blockwise scales stored in FP8 and a global scale to preserve dynamic range. Outlier-sensitive “hot” channels are identified and compensated (HCP) via online patching for further robustness (Dong et al., 2 Feb 2026).
- Blockwise Hadamard Transforms and MR-GPTQ: For post-training quantization, blockwise Hadamard rotations statistically “flatten” heavy tails, and MR-GPTQ (Micro-Rotated-GPTQ) exploits block alignment to minimize groupwise quantization error. This approach closes most of the accuracy gap between INT4 and MXFP4/NVFP4 in attention (Egiazarian et al., 27 Sep 2025).
Sample table: Empirical WikiText-2 PPL / Zero-shot accuracy (Llama-3.1-8B, selected formats)
| Format | PPL | Zero-shot Acc (%) | Notes |
|---|---|---|---|
| BF16 | 10.15 | 86.49 | Baseline |
| MXFP4 | 276.8 | 49.41 | Unusable |
| MXFP4+ | 12.01 | 70.29 | Block-max fixed |
| AMXFP4 | 5.47–6.22 | 62.0 | Asym. scale |
| NVFP4+HCP | -- | ≈95% of BF16 | HCP patch used |
4. Kernel, Circuit, and System-Level Advances
FP4 microscaling’s practical adoption has been enabled by several hardware and software innovations:
- Custom Mixed-Precision GEMM Kernels: Blackwell Tensor Cores support new MMA instructions that consume mixed MXFP4/6/8 operands, fusing dequantization and scaling for every 32-element block; support for per-channel or per-block mixed precision is directly integrated (Liu et al., 4 Aug 2025).
- Analog Compute-in-Memory (CIM): MXFormer demonstrates hybrid analog/digital transformation, using MXFP4-coded weights in dense CTT arrays and precise per-block exponent alignment for static layers; dynamic attention is computed in digital systolic arrays with on-the-fly MXFP4→BF16 conversion (Karfakis et al., 12 Feb 2026).
- FP8 Scale Format Innovations: To avoid quantization error “inversion” at smaller block sizes, the unsigned E5M3 (UE5M3) FP8 format (no sign bit, 5 exponent bits, 3 mantissa bits) provides a -fold wider dynamic range for block scales, eliminating the need for global scaling multipliers (Fasoli et al., 26 Jan 2026).
5. Accuracy–Throughput–Efficiency Trade-offs
The accuracy, speed, memory, and energy implications of FP4 microscaling attention are empirical focal points:
- Inference speedups: Up to 5–6× single-layer and 2–4× end-to-end versus FP16 on Blackwell-class GPUs, with memory down by ≈20% (Liu et al., 4 Aug 2025, Zhang et al., 16 May 2025, Egiazarian et al., 27 Sep 2025).
- Training throughput: End-to-end speedup of 1.6–1.8× versus FP8 and 2.3–2.6× FP16/BF16 (Castro et al., 20 May 2025).
- Accuracy: With outlier control (MX+ or HCP), >95% of FP16 validation accuracy is routinely recovered on Llama and Qwen LLMs. On language modeling (C4 task), FP4 incurs ≤0.03 loss penalty at 100× D/N (Castro et al., 20 May 2025). For vision transformers, MXFP4+ and AMXFP4 restore performance from unusable (PPL>200) to near full precision (Lee et al., 16 Oct 2025, Lee et al., 2024).
- Plug-and-play and retraining: For inference, FP4 microscaling kernels can be adopted without retraining in most scenarios; training stability under pure FP4 is achieved using stochastic rounding, blockwise randomized transforms, and scale schedule optimizations (Castro et al., 20 May 2025, Chen et al., 28 Feb 2025).
6. Stability, Oscillation Reduction, and Implementation Best Practices
Pure FP4 training is prone to “quantization oscillation,” where weights near thresholds flip often, causing instabilities:
- TetraJet and Q-EMA/Q-Ramping: Moving-average quantized (Q-EMA) and adaptive update schedules (Q-Ramping) suppress jitter, reduce training oscillations, and enable stable FP4 convergence with minor hyperparameter tuning (Chen et al., 28 Feb 2025).
- FP4 Scaling Laws: Loss in LLM pretraining can be accurately predicted as a function of model/data size and bit-width-induced effective compute (eff_N, eff_D factors), enabling principled selection of optimal FP4 vs FP8 vs FP16 for any compute/accuracy constraint (Castro et al., 20 May 2025).
- Practical kernel guidelines: Always use unbiased stochastic rounding for backward, deterministic for forward; for NVFP4, small (k=16) group sizes preclude further outlier clipping; dual-scale (AMXFP4) and block-max mantissa-boost (MX+) are preferred in high-outlier regimes (Lee et al., 2024, Lee et al., 16 Oct 2025).
7. Open Challenges and Future Directions
- FP4 end-to-end pretraining: While several frameworks achieve robust inference and stable training at 4 bits, high-quality pretraining with FP4 remains challenging; fine-grained outlier dynamics and post-QK operation protection (e.g., partial BF16 fallback) are sometimes necessary (Dong et al., 2 Feb 2026).
- Per-block vs per-tensor scaling limits: Recent results show anomalous degradation as block size drops below critical values, highlighting the interplay of groupwise quantization and scale dynamic range (Fasoli et al., 26 Jan 2026). Theoretical error decomposition now guides selection of block sizes and scale types.
- Extension to other Transformer primitives: Researchers are expanding FP4 microscaling beyond attention to cover FFN, LayerNorm, and potentially the full Transformer pipeline (FlashInfer, softmax, residuals) (Zhang et al., 16 May 2025).
- CIM and hardware-adaptive design: As hybrid analog/digital architectures (e.g., MXFormer) mature, mixed-mode FP4 microscaling will be further optimized for energy/area density, and blockwise exponent alignment will become a standard primitive (Karfakis et al., 12 Feb 2026).
FP4 microscaling attention now constitutes a mature quantization and computation suite for accelerating Transformer inference and, increasingly, training—offering a well-understood trade space of accuracy, speed, and hardware compatibility (Liu et al., 4 Aug 2025, Lee et al., 16 Oct 2025, Dong et al., 2 Feb 2026, Castro et al., 20 May 2025, Zhang et al., 16 May 2025, Lee et al., 2024, Fasoli et al., 26 Jan 2026, Egiazarian et al., 27 Sep 2025, Karfakis et al., 12 Feb 2026, Chen et al., 28 Feb 2025).