Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
95 tokens/sec
Gemini 2.5 Pro Premium
32 tokens/sec
GPT-5 Medium
18 tokens/sec
GPT-5 High Premium
20 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
468 tokens/sec
Kimi K2 via Groq Premium
202 tokens/sec
2000 character limit reached

FP4 Tensor Cores in Modern GPUs

Updated 6 August 2025
  • FP4 Tensor Cores are 4-bit floating-point accelerators integrated into GPUs to perform dense matrix multiplications with reduced memory usage and energy consumption.
  • They employ advanced quantization schemes like block and group scaling to balance throughput gains with the trade-offs of quantization error and dynamic range limitations.
  • Optimized kernel programming and hardware support in NVIDIA Blackwell GPUs enable significant speedups in large-scale training and inference for LLMs, vision transformers, and more.

FP4 Tensor Cores are ultra-low precision matrix-multiplication accelerator units integrated into modern GPUs, specifically designed to perform dense linear algebra operations using 4-bit floating-point (FP4) formats. These Tensor Cores extend earlier hardware support for FP16/FP8 to FP4 arithmetic, offering substantially higher throughput, reduced memory footprint, and increased energy efficiency at the cost of quantization error and a narrower dynamic range. The proliferation of FP4 Tensor Cores has catalyzed innovation in training and inference methods for large neural networks, especially for LLMs, vision transformers, and other workloads where matrix multiplication is the throughput bottleneck.

1. FP4 Data Formats and Quantization Schemes

FP4 refers to 4-bit floating-point representations, often instantiated as E2M1 (2 exponent bits, 1 mantissa bit, 1 sign bit). Hardware and software systems may further differentiate between block scaling and group scaling schemes. For example, MXFP4 and NVFP4 both use E2M1 for data values but differ in block/group size and scaling format: MXFP4 encodes a group of 32 with an 8-bit exponent-only (E8M0) scale, while NVFP4 operates on blocks of 16 with an E4M3 scale for finer granularity and improved accuracy (Chmiel et al., 25 May 2025).

During quantization, tensors are typically mapped to FP4 using a scaling function tuned to maximize the dynamic range utilization per group. In practice, quantization proceeds as follows:

Q(x)=sround(x/s)\text{Q}(x) = s \cdot \text{round}(x / s)

where ss is a block-wise or vector-wise scale computed with strategies such as absmax or other calibration metrics (Wang et al., 28 Jan 2025, Castro et al., 20 May 2025, Zhou et al., 17 Feb 2025).

Split or mixed quantization–rounding strategies are essential for stable training: round-to-nearest is often employed for the forward pass to stabilize predictions, while stochastic rounding is applied to backward and update GEMMs to mitigate bias accumulation in gradients (Chmiel et al., 25 May 2025). Vector-wise scaling is generally favored over tensor-wise scaling to reduce quantization error.

2. Architectural Support and Kernel Programming

FP4 Tensor Cores first saw hardware support in Turing and have become a native feature in NVIDIA Blackwell GPUs (e.g., RTX5090), which provide dedicated matrix-multiplication instructions for FP4 operands (e.g., FP4MM or tcgen05.mma, supporting group-scaled FP4 GEMM operations) (Castro et al., 20 May 2025, Zhang et al., 16 May 2025). These units are typically accessible through high-level libraries such as CUTLASS, as well as custom CUDA kernels and direct PTX assembly for performance-critical workloads.

To maximize FP4 Tensor Core utilization, input tensors are arranged in tile/block formats that match the fixed matrix sizes handled by the hardware (e.g., 16×16, 32×8, etc.). Kernel designs must align block layouts, manage group scales, and implement efficient interleaved data packing for matrix operands. For inference, attention and GEMM operations are fully quantized and performed natively in FP4; for training, forward, backward, and update passes may all utilize FP4 cores, subject to appropriate mixed-precision handling for numerically sensitive submodules (Castro et al., 20 May 2025, Chmiel et al., 25 May 2025).

3. Algorithmic Innovations for FP4 Stability

Direct training and inference using FP4 precision is challenging due to coarse quantization levels and the risk of representational collapse. State-of-the-art approaches incorporate several innovations:

  • Differentiable Quantization Estimators: By analytically deriving gradient estimators that better approximate the (nondifferentiable) quantization function, training stability and convergence can be maintained even at low bit width. For instance, custom surrogate functions with tunable smoothness and their derivatives replace the crude straight-through estimator (Wang et al., 28 Jan 2025).
  • Outlier Clamping and Compensation: To address outliers in activations, clamping is applied at a high quantile (e.g., 0.99) followed by an exact computation of the sparse residuals on those clamped elements, typically with higher precision, thus preserving the bulk structure while not losing critical signal in rare extreme values (Wang et al., 28 Jan 2025).
  • Early Precision Transition: Both theoretical and empirical evidence suggest that below a threshold (when the gradient norm falls beneath ≈3\sqrt{3} times the FP4 quantization noise), descent progress stalls. Switching back to higher precision (e.g., BF16) at this regime restores effective convergence (Chmiel et al., 25 May 2025).
  • Stochastic Rounding and Misalignment Control: For unbiased gradient flow and reduction of training bias due to quantization, stochastic rounding is mandated on update and backward passes. Innovations such as misalignment metrics quantify and bound long-range error accumulation (Castro et al., 20 May 2025).
  • Advanced Quantization (e.g., QuEST): Forward quantization benefits from schemes based on the Hadamard transform and root-mean-squared error (RMSE) clipping, reducing mean squared error across quantized blocks (Castro et al., 20 May 2025).

4. Performance, Efficiency, and Scaling Laws

FP4 Tensor Cores provide a dramatic increase in arithmetic throughput and corresponding decreases in memory footprint and energy consumption. Performance measurements indicate (Zhang et al., 16 May 2025, Castro et al., 20 May 2025):

  • Per-kernel throughput of over 1000 TOPS on RTX5090 (FP4 vs. 200 TOPS FP16).
  • Attention and GEMM operations using FP4 achieve 4–8× speedup over FP16-optimized baselines.
  • End-to-end inference latency in real-world workloads (e.g., video/text generation) reduced by as much as 3×.
  • Full-training pipelines using Quartet, FP4 All the Way, or similar techniques demonstrate up to 1.8×–2.6× training speedup, with no significant loss in accuracy compared to FP16/BF16/FP8 when proper rounding, group scaling, and regularization are used (Castro et al., 20 May 2025, Chmiel et al., 25 May 2025, Wang et al., 28 Jan 2025).

A formal scaling law is articulated (Castro et al., 20 May 2025), quantifying the relationship between final loss, parameter size, data size, and precision:

L(N,D,Pfwd,Pbwd)=(A(NeffN(Pfwd))α+B(DeffD(Pbwd))β)γ+EL(N, D, P_{\text{fwd}}, P_{\text{bwd}}) = \left( \frac{A}{(N \cdot \mathrm{eff}_N(P_{\text{fwd}}))^\alpha} + \frac{B}{(D \cdot \mathrm{eff}_D(P_{\text{bwd}}))^\beta} \right)^\gamma + E

where effN\mathrm{eff}_N and effD\mathrm{eff}_D account for the impact of quantization noise and rounding bias on forward and backward passes, respectively.

5. Applications in Large-Scale AI Workloads

FP4 Tensor Cores have been rapidly adopted in large model training and inference, including but not limited to:

  • LLM Training: Quartet and FP4 All the Way demonstrate end-to-end, full-FP4 training for LLMs, including weight, activation, and gradient quantization, scaling to 13B+ parameters and massive token datasets (Castro et al., 20 May 2025, Chmiel et al., 25 May 2025, Zhou et al., 17 Feb 2025, Wang et al., 28 Jan 2025).
  • Transformer Attention Acceleration: SageAttention3 exploits FP4 Tensor Cores and block-wise microscaling quantization to massively accelerate inference for attention-heavy workloads, achieving >5× kernel speedups and minimal quality loss for multi-modal tasks (Zhang et al., 16 May 2025).
  • Sparse and Hybrid Operator Acceleration: By synergizing CUDA and Tensor Cores, frameworks such as Libra target structured and unstructured sparse operations, allocating FP4-enabled TCUs to dense submatrices to maximize throughput without excessive padding redundancy (Shi et al., 28 Jun 2025).

6. Engineering Trade-offs and Limitations

Key limitations and open challenges:

  • Numerical Precision: The dynamic range of FP4 (E2M1) limits representable values and increases quantization error compared to FP8/FP16. Model stability is dependent on sophisticated calibration, block scaling, and clamping.
  • Granularity of Scaling: Finer granularity (e.g., group-of-16 scaling in NVFP4) improves accuracy but may have hardware/software overhead considerations. The optimal block size is a balance of memory access patterns, scale register usage, and quantization error (Chmiel et al., 25 May 2025).
  • Training Convergence: Precise rounding strategies and early switch mechanisms are required to avoid convergence stalling when the gradient norm becomes too small relative to quantization noise.
  • Minimal Software Overhead: Group scaling, custom quantization, and fusion of quantization with GEMM/attention must be carefully designed so as not to offset the hardware throughput advantage with excessive kernel or memory handling overhead.

7. Future Directions and Research Opportunities

The advent of native FP4 hardware in Blackwell and forthcoming accelerator generations is expected to solidify FP4 as a de facto standard for both inference and training in high-capacity models. Research frontiers include:

  • Advanced quantizer designs and learning-based scale selection to further compress quantization error.
  • Extension of FP4 to more complex numerical workloads beyond GEMM and attention, such as FFT, reduction/scan, and scientific computing kernels, building atop earlier work with FP16/FP8 TCUs.
  • Deeper algorithm–hardware co-design for distributed training, efficient gradient synchronization in FP4, and end-to-end pipelines that maintain numerical fidelity under extreme compression.

Much of the required methodology—block-wise scaling, stochastic rounding, backward misalignment control, and fusion of quantization with optimized CUDA kernels—will generalize as FP4 support becomes universal across new accelerator architectures.