Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Published 18 Jun 2026 in cs.AI | (2606.20381v1)

Abstract: FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.

Summary

  • The paper introduces the concept of Shrinkage Bias in non-uniform FP4 grids, revealing its geometric origin and its role in systematic signal attenuation.
  • The paper employs analytical derivations and empirical diagnostics on various LLM scales to quantify the bias, showing that uniform grids like E1M2/INT4 eliminate this issue.
  • The paper presents the UFP4 training recipe, integrating full RHT coverage and limited stochastic rounding to achieve reduced BF16-relative loss degradation across scales.

Summary of "Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe" (2606.20381)

FP4 Grid Geometry: Shrinkage Bias and Its Accumulation

The paper investigates the implications of 4-bit floating-point (FP4) data formats in LLM pretraining, critically analyzing the prevailing use of non-uniform E2M1 grids in hardware platforms (e.g., NVIDIA Blackwell/Rubin, AMD MI350) and training recipes. The authors formalize Shrinkage Bias: a systematic negative rounding error inherent to non-uniform FP4 formats like E2M1, arising from geometric asymmetry in their RTNE (Round-to-Nearest-Even) rounding bins. This bias leads to a multiplicative signal attenuation across network layers, distinct from zero-mean stochastic quantization error.

Through both analytic derivations and empirical diagnostics on MLP/attention layers, the bias is quantified and shown to propagate, especially when tensors are subjected to Random Hadamard Transform (RHT) for outlier mitigation. RHT, while intended to improve codebook utilization, transfers tensor mass into the most asymmetrical — and hence most biased — bins of E2M1, correspondingly degrading signal fidelity and training stability.

Uniform FP4 grids such as E1M2 or INT4, by contrast, do not exhibit this geometric bias. The bin symmetry ensures unbiased rounding, providing consistent preservation of magnitude postquantization, regardless of tensor rotation or outlier dispersal.

UFP4 Training Recipe: Design and Evaluation

Informed by the geometric analysis, the paper introduces UFP4, an FP4 training recipe with the following defining characteristics:

  • Uniform Grid Quantization: E1M2/INT4-style grids supplant non-uniform E2M1, eliminating Shrinkage Bias.
  • Full RHT Coverage: RHT is applied to operands for all linear-layer GEMMs (forward, data-gradient, weight-gradient), not just weight-gradient as in standard E2M1/NVFP4 recipes.
  • Limited Stochastic Rounding: SR is employed only on dY (upstream gradients), not other GEMM operands.
  • Matched Auxiliary Settings: Block size, scale hierarchy, and SR scope are matched in comparison experiments, isolating the impact of grid selection and RHT scope.

Extensive empirical validation is conducted on Dense 1.5B, MoE 7.9B, and MoE 124B configurations. Across all scales, UFP4 demonstrates lower BF16-relative loss degradation than the best-tuned E2M1 baselines. For Dense 1.5B, MoE 7.9B, and MoE 124B, the BF16-relative loss error is consistently reduced (e.g., 1.2570% → 0.9673% for Dense 1.5B). Scaling-law analyses reinforce that the UFP4 advantage persists at larger computational budgets, with the penalty to BF16 narrowing as compute increases.

Ablation studies further show that full RHT coverage is beneficial for UFP4 but harmful in E2M1 due to geometric bias amplification post-RHT. Attempts to emulate uniform grid behavior by restricting E2M1’s dynamic range fail to overcome its underlying bias and lead to suboptimal bucket utilization and increased training loss.

Hardware and Software Implications

The results and analysis advocate for a shift in hardware support. While E2M1 remains suited for range-limited inference scenarios and raw outlier-heavy tensors, current training pipelines should not default to it as the sole FP4 primitive. The paper recommends future ML accelerators support E1M2/INT4-style uniform grids as first-class training formats.

Furthermore, the study establishes that techniques for quantization/estimator stability — including block scaling, scale hierarchy, adaptive rounding, tensor-side preprocessing (RHT, QuaRot, SpinQuant, FlatQuant, SVD decomposition) — are orthogonal and complementary to uniform grid selection. When coupled with unbiased uniform grids, these approaches can more effectively translate improved tensor distributions into quantization fidelity.

Kernel fusion of RHT and quantization is experimentally shown to be efficient, incurring only minor overhead (1.06–1.07x standalone quantization).

Practical and Theoretical Impact, Future Directions

The theoretical formalization of Shrinkage Bias and its empirical validation directly explain why training instability and loss degradation persist in current E2M1-based FP4 recipes, despite various stabilization techniques. The UFP4 recipe provides a practical solution applicable at industrial scale, indicating uniform grid-based FP4 quantization delivers superior training stability, especially as RHT and other preprocessing become standard.

Practically, this work motivates hardware manufacturers to adapt FP4 training interfaces, allowing LLM practitioners to leverage both memory/computation efficiencies and quantization stability. Theoretically, the results sharpen understanding of quantizer-induced bias in deep network training and establish guidelines for numerical format selection in low-bit regimes.

Future research questions include: optimal scheduling and further fusion of tensor preprocessing and quantization, the interplay between uniform grids and advanced gradient estimators, and validation of UFP4 on even larger models and across additional hardware platforms (e.g., Ascend NPUs adopting HiFloat4).

Conclusion

This paper provides a rigorous examination of Shrinkage Bias as a systemic issue in FP4 pretraining for LLMs, traced to geometric asymmetry in non-uniform grids and exacerbated by standard outlier-mitigation strategies. The proposed UFP4 recipe — using E1M2/INT4-style uniform grids and full RHT coverage — delivers consistently improved training stability and reduced BF16-relative loss across scales. The findings call for a reconsideration of FP4 training primitives in hardware-software stacks, recommending uniform grids as first-class formats for efficient and stable LLM pretraining (2606.20381).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about (in simple terms)

Training huge LLMs is expensive because they use lots of memory and math. One way to make training cheaper is to use tiny numbers (just 4 bits) instead of bigger ones. This paper asks: are we using the right kind of 4‑bit numbers? The authors show that the most common 4‑bit format (called E2M1) has a hidden problem that quietly weakens signals during training. They explain why this happens, show how the problem builds up across layers, and introduce a new recipe called UFP4 that avoids the issue and makes training steadier and more accurate.

The main questions the paper asks

  • Does the exact “shape” of the 4‑bit number line matter for training?
  • Why do some current 4‑bit setups feel unstable or lose accuracy compared to higher‑precision training (like BF16)?
  • Can a simple change—switching to a uniform 4‑bit grid and adjusting where we apply a “mixing” step—fix that?
  • Will this work on both small and giant models?
  • Is the fix practical on real hardware (fast enough and easy to run)?

How they approach it (explained with everyday analogies)

  • 4‑bit numbers as marks on a ruler:
    • Imagine you’re rounding measurements to marks on a ruler. If the marks are evenly spaced (uniform), rounding errors cancel out. If the marks are uneven (non‑uniform), rounding tends to pull numbers in one direction.
    • E2M1 is like a ruler with uneven spacing between marks. E1M2/INT4 are like rulers with even spacing.
  • The hidden problem: “Shrinkage Bias”
    • With uneven marks, rounding tends to push values slightly toward zero on average. Think of it like turning down the volume just a tiny bit every time you pass sound through a filter.
    • In deep networks, you pass signals through many layers. If each layer shrinks the signal a little, the total shrinkage multiplies across layers—like a quiet whisper getting even quieter as it’s passed along.
  • The mixing step (RHT = Random Hadamard Transform)
    • Before rounding to 4‑bit, many recipes “mix” or spread out large spikes (outliers) across many positions so everything fits better. Picture spreading a pile of sand evenly across a tray so no spot sticks up too high.
    • This mixing helps use the 4‑bit levels more evenly. But if your ruler is uneven (E2M1), the mixing can actually push more values into the worst parts of that ruler, making the shrinkage worse.
    • If your ruler is even (E1M2/INT4), the mixing works as intended: it spreads values and rounding stays fair.
  • Measuring success
    • They test both tiny parts (individual tensors), medium parts (single matrix multiplications—GEMMs, the “math engines” of neural nets), and full end‑to‑end training on models ranging from 1.5B to 124B parameters.
    • They compare against BF16 (a common higher‑precision format treated as the “gold standard”) and track how much extra loss (error) 4‑bit training adds.

What they found (and why it matters)

  • Uniform grids avoid the bias:
    • E2M1’s uneven spacing causes a systematic “toward‑zero” rounding error (Shrinkage Bias).
    • This bias accumulates across layers and quietly dims signals.
    • E1M2/INT4 use even spacing, so this bias disappears.
  • Mixing (RHT) helps only with the right grid:
    • With E2M1 (uneven ruler), RHT often makes things worse by sending values into the most biased zones.
    • With E1M2/INT4 (even ruler), RHT boosts quality because spreading values plus fair rounding works well together.
  • The UFP4 recipe:
    • Switch to a uniform 4‑bit grid (E1M2/INT4 style).
    • Apply the mixing step (RHT) everywhere it matters—on all three big GEMMs used in training: forward pass, data‑gradient, and weight‑gradient.
    • Use stochastic rounding (a fair, coin‑flip style rounding) only for one gradient (called dY) to keep gradients unbiased.
    • Result: steadier training and smaller accuracy gaps to BF16, not just on small models but also on very large ones (including a 124B Mixture‑of‑Experts model).
  • Concrete improvements:
    • Across multiple long training runs, the uniform‑grid UFP4 recipe consistently had lower extra loss than strong E2M1‑based baselines. For example, on big runs the BF16‑relative loss gap dropped noticeably (e.g., roughly from about 1.26% to 0.97% on a 1.5B dense model, and similar improvements on large MoE models).
    • Trying to “imitate” a uniform grid by restricting E2M1’s range didn’t work well—it reduced range too much and still didn’t match UFP4’s stability or accuracy.
    • Performance overhead is small: the mixing step can be fused with quantization, adding only about 6–7% more time for that part, which is practical.

Why this is important

  • Cheaper, greener training:
    • Using 4‑bit numbers can cut memory and compute costs a lot, making large‑scale training more affordable and energy‑efficient.
  • Stability and accuracy:
    • The paper shows that the problem wasn’t just “4‑bit is too small,” but “we picked the wrong 4‑bit grid for the way we process tensors.” With a better grid and where we apply mixing, 4‑bit training can be both stable and accurate.
  • Hardware guidance:
    • Today’s hardware mostly focuses on E2M1. The results suggest future chips should also support uniform 4‑bit formats (E1M2/INT4) as a first‑class option, so training systems can use recipes like UFP4 reliably.

Simple takeaway

If you must round everything to a tiny set of numbers, make sure the “ruler” you’re rounding to has evenly spaced marks. That small design choice removes a quiet but powerful source of error that otherwise builds up across layers. By pairing an even 4‑bit grid with a smart mixing step and careful rounding, training huge LLMs can be both cheaper and more stable—bringing us closer to faster, greener AI without sacrificing learning quality.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains uncertain or unexplored, aimed to guide future research:

  • Formal assumptions in the shrinkage-bias theory: The propagation analysis relies on locally uniform bin density, no clipping (t ≤ max(G)), and incoherent residuals in GEMMs. It lacks formal bounds under realistic distributions, clipping/saturation, correlated errors, residual connections, LayerNorm, and nonlinearity; rigorous end-to-end guarantees are missing.
  • Role of stochastic rounding (SR) beyond dY: The core bias argument targets RTNE; SR is only applied to dY in UFP4. It remains unclear how applying SR on fwd_y and bwd_dx (or on all operands) would affect bias/variance trade-offs, convergence, and final quality under a uniform grid.
  • Alternative rounding schemes: The paper does not evaluate bias-corrected RTN, dithered quantization, Kahan-compensated accumulation, or learned rounding (e.g., FAAR-like) on uniform grids; it is unknown whether these could further close the BF16 gap.
  • Interaction with scale hierarchy: Results are reported with matched FP32 single-level scales and no 2D weight scaling. The impact of two-level/hierarchical scaling, per-channel scales, or adaptive block scales combined with E1M2/INT4 and full-RHT remains unquantified.
  • Dynamic-range vs. local-resolution trade-offs: Uniform grids reduce geometric bias but may increase underflow/overflow risk. The paper does not report clipping/overflow rates, scale distributions, or guardband requirements across layers, models, and training phases.
  • Sensitivity to RHT configuration: Only block-Hadamard (H16) aligned with 1×16 quant blocks is studied. The effects of larger/smaller blocks, multi-stage/learned rotations, per-channel or per-head rotations, or hybrid rotations on utilization, SQNR, and stability are open.
  • Alternatives to Hadamard: Other orthogonal transforms (e.g., randomized DCT, Butterfly, Householder stacks, or learned/quasi-orthogonal rotations) might improve bucket utilization or reduce bias; no comparison is provided.
  • Format variants and mixed formats: While E1M2 and INT4-style uniform grids are advocated, the paper does not directly compare E1M2 vs INT4 (with identical scaling/blocks) or evaluate mixed per-block format selection (e.g., MixFP4) under full-RHT.
  • Range-restricted E2M1 beyond simple max cuts: The negative result for crude range restriction (max_fpx ∈ {2,3,4}) leaves open whether smarter, data-driven codebook pruning, non-uniform-to-uniform remapping, or per-block codebook warping could emulate uniform-grid behavior.
  • Generality across architectures and components: Experiments focus on transformer LLMs (Dense and MoE) with SwiGLU MLPs. Effects on other activations, attention variants, normalization schemes, residual scaling strategies, and non-LLM modalities (CV, speech, RL) are unknown.
  • Training regimes beyond pretraining: The impact of UFP4 on instruction tuning, RLHF/DPO, domain adaptation, and low-data finetuning (where optimization noise and curvature differ) is not evaluated.
  • Downstream/task-level metrics: Results are reported as LM loss/perplexity deltas; there is no assessment of downstream benchmarks (reasoning, coding, safety), calibration, or sample quality, limiting conclusions about real-world utility.
  • Robustness and variance: The paper does not report seed variance, instability/divergence rates, or sensitivity to batch size, gradient clipping, learning-rate schedules, or optimizer settings under UFP4 vs E2M1.
  • Optimizer state precision: Master weights remain FP32; the effect of quantizing optimizer states (e.g., FP8/FP4 moments) alongside UFP4 on convergence, memory, and speed is unknown.
  • Communication and distributed systems: Full-RHT on three GEMMs may interact with tensor/model parallelism and MoE all-to-all patterns. The impact on communication volume, overlap, and throughput in large-scale clusters is not measured.
  • End-to-end performance/efficiency: Kernel-level fused RHT+quantization overhead is small, but full-system throughput, energy efficiency, and cost-per-token comparisons (including memory traffic, pipeline bubbles, and comms) are not reported.
  • Long-context and curriculum effects: Whether UFP4 behavior changes with very long sequences, curriculum schedules, or dynamic token mixing (where activation scales shift) remains untested.
  • Clipping-aware analysis: The theoretical and empirical sections largely set aside clipping; quantifying how often E1M2/INT4 incurs saturation/underflow (vs E2M1), and its effect on shrinkage, gradient flow, and convergence, is an open need.
  • Coherent attenuation measurement: The “coherent attenuation” factor κ (or α) is posited but not directly measured per layer/path; methods to estimate it online and use it for bias correction or adaptive scaling are unexplored.
  • Safety margins for deployment: Guidance on selecting block sizes, transforms, and scales to meet target failure rates (e.g., no divergence at trillion-token runs) is absent; reliability curves and failure-mode taxonomies are needed.
  • Hardware design questions: Concrete microarchitectural implications of first-class E1M2/INT4 support (subnormals, denorm handling, scale packing, SR units, fused RHT pipelines), area/power trade-offs, and backward compatibility with E2M1 are not addressed.
  • Compatibility with other stabilizers: The paper argues many E2M1-centric methods “treat symptoms.” It remains to be tested whether combining UFP4 with Quartet II, TetraJet-v2, FAAR, or outlier-channel separation yields additive gains or redundancy.
  • Formal closing of the FP4-to-BF16 gap: UFP4 narrows but does not remove the BF16 gap. What additional ingredients (format, rounding, scaling, training schedules) are required to reach parity on large-scale pretraining remains an open target.
  • Public reproducibility: Detailed configs, seeds, code, and logs for large runs are not provided; reproducibility and cross-lab validation on different hardware stacks are still needed.
  • Broader metrics of quality and risk: Effects on calibration, uncertainty, bias/fairness, toxicity, and robustness to distribution shift under UFP4 vs E2M1 are unexamined.

Practical Applications

Immediate Applications

The paper’s findings and UFP4 recipe enable several concrete actions that teams can deploy now, especially where uniform 4‑bit grids (E1M2/INT4‑style) and RHT can be implemented in software.

  • Deploy UFP4 in LLM pretraining to reduce BF16-relative degradation while using 4-bit precision
    • Sector: Software (AI/ML platforms), Cloud/AI training services, Enterprise AI teams
    • Tools/Products/Workflows: Integrate UFP4 recipe into PyTorch/JAX training stacks; enable RHT on all three GEMMs (FPROP/DGRAD/WGRAD) with block size 16; restrict stochastic rounding (SR) to dY; keep 1×16 quant blocks and FP32 single-level scaling as in the paper
    • Assumptions/Dependencies: Availability of uniform-grid quantization operators (E1M2/INT4-style) and fused RHT+quant kernels; ability to apply RHT along the shared reduction dimension (tensor shapes divisible by Hadamard block size); SR availability on dY (software SR if hardware SR is absent)
  • Retune existing FP4 pipelines: keep E2M1 for outlier-heavy tensors but switch to uniform grids post-RHT
    • Sector: Software (training systems), Cloud providers
    • Tools/Products/Workflows: Maintain E2M1/NVFP4 where RHT is off or limited; enable UFP4 when RHT is applied; encode a policy: “RHT on ⇒ use uniform grid; RHT off ⇒ E2M1 acceptable”
    • Assumptions/Dependencies: Mixed-format support within quantization libraries and kernel dispatchers; runtime or offline heuristics to detect when RHT pushes tensors into a local‑resolution‑limited regime
  • Adopt fused RHT+quantization kernels to minimize overhead
    • Sector: Software/Systems, GPU kernel libraries
    • Tools/Products/Workflows: Implement or integrate fused block-Hadamard + blockwise quantization kernels (block size 16) in CUTLASS/cuBLASLt/Triton; derive per-block scales after RHT without materializing rotated tensors
    • Assumptions/Dependencies: Kernel engineering resources; benchmarks on target GPUs (e.g., SM90/SM100) confirm ~1.06–1.07× overhead vs. standalone quant; end-to-end training integration costs remain manageable
  • Add Shrinkage Bias diagnostics to training telemetry to catch format-induced instability
    • Sector: Software tooling, MLOps
    • Tools/Products/Workflows: Compute effective bucket ratio (entropy-based), ΔSQNR pre/post-RHT, and per-GEMM attenuation factors (αA, αB) to detect multiplicative signal decay; alert when E2M1+RHT pushes mass into asymmetric bins
    • Assumptions/Dependencies: Low-overhead probes during forward/backward passes; log aggregation/alerting in training dashboards
  • Use UFP4 for MoE and dense LLMs to cut cost/energy while preserving quality
    • Sector: Cloud/Datacenters, Energy/Sustainability, Finance/Healthcare/Enterprise AI (cost-sensitive training)
    • Tools/Products/Workflows: Swap BF16 or E2M1‑based FP4 pretraining with UFP4 on dense 1.5B and MoE 7.9B–124B class models; adopt the paper’s RHT/SR scopes and block configs
    • Assumptions/Dependencies: Uniform-grid path deliverable on target accelerators; training hyperparameters otherwise unchanged; evaluation confirms loss gap reduction in the target domain
  • Establish engineering guidelines for RHT scope and SR placement
    • Sector: Software/Training Ops
    • Tools/Products/Workflows: Codify recipe principles: (1) full RHT across FPROP/DGRAD/WGRAD is beneficial with uniform grids, (2) apply SR only to dY, (3) avoid RHT on non‑leaf paths when stuck on E2M1
    • Assumptions/Dependencies: Team processes to standardize recipes across projects; CI pipelines to validate adherence and quality impacts

Long-Term Applications

The paper’s systemic analysis and results motivate hardware, software, and standards evolution toward uniform 4‑bit grids as first-class training primitives.

  • Add first-class uniform FP4 training data elements (E1M2/INT4-style) in future accelerators
    • Sector: Semiconductors/Hardware, Cloud infrastructure
    • Tools/Products/Workflows: Design tensor cores and memory formats that natively support uniform FP4 grids alongside E2M1; expose fused RHT-friendly quantization paths; provide hardware SR on dY
    • Assumptions/Dependencies: ASIC area/power trade-offs; vendor toolchain updates; verification that uniform grids meet inference/training needs across workloads
  • Introduce framework-level APIs for grid selection, RHT fusion, and SR control
    • Sector: Software frameworks (PyTorch, JAX), Compilers (XLA, TorchInductor), Graph optimizers
    • Tools/Products/Workflows: High-level API to declare grid geometry per tensor/GEMM; automatic block-Hadamard insertion along reduction dims; kernel fusion passes; runtime selection policies based on bucket utilization and ΔSQNR
    • Assumptions/Dependencies: Stable operator specs; kernel library support; minimal graph-breaks with fused ops
  • Develop bias-aware quantization planners that optimize grid + RHT jointly
    • Sector: Software/Algorithms research, AutoML
    • Tools/Products/Workflows: Planners that analyze tensor stats to pick E1M2 vs. E2M1 (or mixed) per block/layer; adjust RHT scope and block sizes; integrate with adaptive rounding (e.g., FAAR) for uniform grids
    • Assumptions/Dependencies: Low runtime overhead for statistics collection; robust heuristics generalizing across models and training phases
  • Standardize evaluation and benchmarks for low-precision training that include Shrinkage Bias metrics
    • Sector: Benchmarking/Standards (MLPerf, ONNX/Khronos, IEEE FP)
    • Tools/Products/Workflows: Extend benchmarks with effective-bucket ratio, ΔSQNR, αAαB attenuation, BF16-relative loss across long-run pretraining; define ONNX ops for E1M2 FP4 quantization and fused RHT
    • Assumptions/Dependencies: Community consensus; vendor participation; reproducible measurement protocols
  • Enable industry-specific pretraining (privacy/compliance settings) with lower cost and energy
    • Sector: Healthcare, Finance, Government, Telco
    • Tools/Products/Workflows: On-prem or sovereign-cloud pretraining using UFP4 to meet budget and sustainability goals; MoE architectures at 4-bit to fit constrained clusters
    • Assumptions/Dependencies: Security and compliance reviews of low-precision training; availability of uniform-grid hardware or efficient software emulation
  • Advance theory and pedagogy around geometric bias in quantization
    • Sector: Academia/Education
    • Tools/Products/Workflows: Courses, labs, and research projects on grid geometry, bin asymmetry, and multiplicative error accumulation; formal analyses of RHT interactions with non‑uniform vs. uniform grids
    • Assumptions/Dependencies: Access to open-source kernels/datasets; reproducible baselines
  • Create reliability and safety checks for low-precision training pipelines
    • Sector: MLOps/SRE, Policy/Governance
    • Tools/Products/Workflows: Conformance tests that detect shrinkage-induced underfitting; guardrails to disable harmful RHT scopes on non‑uniform grids; reporting of low-precision compliance in model cards
    • Assumptions/Dependencies: Organizational processes for model governance; alignment with emerging AI assurance frameworks
  • Hybrid format ecosystems for training and inference
    • Sector: Hardware/Software co-design, Inference platforms
    • Tools/Products/Workflows: Train with uniform FP4 (UFP4) and deploy with mixed formats (e.g., non-uniform FP4 or INT4/FP8) tuned for inference constraints; compiler passes to convert checkpoints across grids with minimal loss
    • Assumptions/Dependencies: Cross-format checkpoint converters; calibration procedures for inference quantization

Notes on feasibility and dependencies:

  • Availability of uniform-grid 4‑bit GEMM on current hardware varies; where absent, software emulation may reduce the throughput gains of FP4 training.
  • RHT requires power-of-two block sizes on the reduction dimension; mismatched shapes may need padding or alternate orthogonal transforms.
  • SR limited to dY presumes either hardware SR support or performant software SR; determinism and reproducibility controls may be needed.
  • While UFP4 consistently reduced BF16-relative loss in the paper’s settings, some architectures/datasets may require tuning (block size, scaling hierarchy, RHT scope).

Glossary

  • Ablation studies: Controlled experiments removing or altering components to assess their effect on performance. "supported by scaling-law analysis and ablation studies."
  • Attenuation factor: A multiplicative factor that reduces the coherent signal through quantized operations. "let ηkαA,kαB,k\eta_k \approx \alpha_{A,k}\alpha_{B,k} denote the coherent attenuation factor of the kk-th operation."
  • BF16: A 16-bit floating-point format (bfloat16) used as a high-precision baseline in training comparisons. "suffer from severe convergence issues and loss degradation relative to BF16"
  • BF16-relative loss degradation: The increase in training loss measured relative to a BF16 baseline. "UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines,"
  • Block Hadamard transform: Applying a Hadamard transform on small tensor blocks before quantization to disperse outliers. "Full RHT adds only a small block Hadamard transform before FP4 quantization."
  • Blockwise quantization: Quantizing tensors in small contiguous blocks that share a scale to improve precision. "In practice, blockwise quantization is widely adopted to improve precision by partitioning a tensor T\mathbf T into contiguous blocks"
  • Bucket entropy: The entropy of the empirical distribution over quantization magnitude buckets, used to assess utilization. "we define the bucket entropy E(G,T)\mathcal{E}(G,T)"
  • Bucket utilization: How evenly quantization buckets are used by a tensor’s values. "better convert the improved bucket utilization from RHT into higher quantization quality."
  • Codebook: The discrete set of representable quantization levels for a given format. "Let G={g}G=\{g\} denote the normalized codebook of a chosen format"
  • DGRAD: The backward-path GEMM computing data gradients. "data-gradient (DGRAD, bwd)"
  • dY: The upstream gradient tensor in backpropagation. "stochastic rounding only when quantizing the upstream gradient dYdY"
  • Dynamic-range-limited: A regime where representation is constrained by extreme values rather than local precision. "from being dynamic-range-limited to local-resolution-limited."
  • E1M2: A 4-bit floating-point format with 1 exponent bit and 2 mantissa bits, forming a uniform grid. "uniform grids (E1M2/INT4) bypass this grid-geometry error"
  • E2M1: A 4-bit floating-point format with 2 exponent bits and 1 mantissa bit, forming a non-uniform grid. "Non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias"
  • E8M0: A floating-point scale format (8 exponent bits, 0 mantissa bits) used for per-block scaling. "MXFP4 uses 1×321{\times}32 blocks with an E8M0 scale;"
  • Effective bucket ratio: An entropy-derived measure of how many quantization buckets are effectively utilized. "report the effective bucket ratio $B_{\mathrm{eff}(G,T)$"
  • FP4: 4-bit floating-point representation used to reduce memory and computation in training. "FP4 training promises substantial reductions in memory and computation cost for LLM pretraining"
  • FPROP: The forward-propagation GEMM path. "forward (FPROP, fwd)"
  • Fused-kernel: An implementation that combines multiple operations into one GPU kernel for efficiency. "supported by scaling-law analysis, ablation studies, and fused-kernel benchmarks,"
  • GEMM: General Matrix–Matrix Multiply, the core linear algebra operation in deep learning layers. "the three training GEMMs"
  • INT4: 4-bit integer quantization format or codebook. "and an INT4 codebook (\Cref{fig:fp4-int4-format-codebooks})."
  • Leaf-gradients: Gradients that are directly consumed by the optimizer and do not propagate further. "while quantization errors in bwd are leaf-gradients directly consumed by the optimizer,"
  • MoE: Mixture-of-Experts, a model architecture with multiple expert subnetworks. "MoE 124B long-run pretraining"
  • MXFP4: A microscaling FP4 scheme that uses per-block scaling for improved accuracy. "For example, MXFP4~\citep{rouhani2023microscalingdataformatsdeep} and NVFP4 \citep{nvidia2026pretraininglargelanguagemodels}"
  • NMSE: Normalized Mean Squared Error, used to quantify quantization error relative to signal energy. "Normalized MSE, NMSEA(G,T)\mathrm{NMSE}_{A}(G,T)"
  • Norm-preserving rotation: An orthogonal transform that preserves vector norms while redistributing values. "applying a norm-preserving rotation that disperses outlier energy across all coordinates before quantization."
  • NVFP4: NVIDIA’s FP4 training recipe and data path with fine-grained scaling. "NVFP4 typically provides better training accuracy"
  • Orthogonal: A matrix property ensuring norm preservation and invertibility via its transpose. "is orthogonal"
  • Outliers: Values with unusually large magnitude that can dominate quantization scales. "Real training tensors often contain outlier coordinates,"
  • Random Hadamard Transform (RHT): A randomized Hadamard rotation used to spread outlier energy and improve bucket utilization. "Random Hadamard Transforms (RHT) address this by applying a norm-preserving rotation"
  • Round-To-Nearest-Even (RTNE): A deterministic rounding rule that rounds to the nearest representable value with ties to even. "The rounding rule ρG\rho_G is either Round-To-Nearest-Even (RTNE) or Stochastic Rounding (SR)."
  • RTNE rounding bin: The interval around a representable level assigned under RTNE. "the interior of its RTNE rounding bin is"
  • Scale hierarchy: Multi-level scaling design for quantization that balances range, precision, and efficiency. "two-level scale hierarchy"
  • Scaling-law analysis: Methodology fitting performance as a function of compute/model size to compare regimes. "supported by scaling-law analysis"
  • Shrinkage Bias: A systematic negative (toward-zero) rounding error from asymmetric bins in non-uniform grids. "Non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias"
  • Signal-to-Quantization-Noise Ratio: A metric comparing signal fidelity to quantization noise power. "Signal-to-Quantization-Noise Ratio:"
  • Stochastic rounding (SR): A probabilistic rounding scheme that preserves the value in expectation. "The rounding rule ρG\rho_G is either Round-To-Nearest-Even (RTNE) or Stochastic Rounding (SR)."
  • Sylvester Hadamard matrices: Recursively defined orthogonal ±1 matrices used for Hadamard transforms. "we use the Sylvester Hadamard matrices defined recursively as"
  • SwiGLU: An activation variant whose outputs can exhibit outlier-heavy behavior. "consistent with the outlier-amplifying behavior of SwiGLU"
  • UFP4: The proposed uniform 4-bit training recipe based on an E1M2/INT4-style grid and full RHT. "we propose UFP4, a uniform 4-bit training recipe"
  • Uniform grid: A quantization grid with evenly spaced representable levels, eliminating bin asymmetry. "uniform grids (e.g., E1M2)"
  • WGRAD: The backward-path GEMM computing weight gradients. "weight-gradient (WGRAD, bwd)"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 217 likes about this paper.