Papers
Topics
Authors
Recent
2000 character limit reached

SageAttention3: Low-Bit Quantized Attention

Updated 17 October 2025
  • SageAttention3 is a low-bit quantization framework that boosts transformer performance by leveraging FP4 Tensor Cores for inference and INT8 for training.
  • It employs FP4 microscaling to achieve up to 11× speedup over conventional kernels, while maintaining lossless accuracy in fine-tuning tasks.
  • The framework pioneers 8-bit quantized training, offering competitive results and identifying pretraining convergence challenges due to backward quantization sensitivity.

SageAttention3 is a low-bit attention computation and quantization framework designed to dramatically accelerate transformer inference and, for the first time, extend efficient, quantized attention mechanisms into the training process. By exploiting FP4 Tensor Cores in Blackwell GPUs and introducing an 8-bit backward-compatible training kernel, SageAttention3 advances the state of the art in both speed and resource efficiency, achieving lossless inference performance and highly competitive fine-tuning outcomes for modern large-scale models (Zhang et al., 16 May 2025).

1. FP4 Microscaling Attention for Inference Acceleration

SageAttention3 utilizes FP4 Tensor Cores on Blackwell GPUs, applying "microscaling" quantization to the Q, K, and V matrices fundamental to transformer attention. Each 1×n1 \times n matrix block is paired with a scale factor sijs_{ij}, computed as sij=max(X)/6s_{ij} = \max(|X|)/6, and each entry quantized with FP4 rounding: X^ij=Xij/sij\hat{X}_{ij} = \lceil X_{ij}/s_{ij} \rceil. FP4 quantized matrix multiplications are realized via an FP4MM instruction, which multiplies quantized blocks alongside their scale factors:

C=FP4MM(A^,sA,B^,sB)C = \text{FP4MM}(\hat{A}, s_A, \hat{B}, s_B)

FP4 arithmetic achieves theoretical kernel throughput near 1600 TOPS, with SageAttention3 empirically reaching 1038 TOPS on RTX5090—yielding at least a 5×5\times speedup over the fastest FlashAttention kernel on the same hardware. The design enables a plug-and-play replacement for full-precision attention modules in existing inference pipelines, supporting a wide range of models (text-to-video, generative LLMs, T2I, etc.) with minimal (<0.2%) end-to-end accuracy degradation.

2. 8-Bit Quantized Attention for Training and Fine-Tuning

SageAttention3 pioneers the use of low-bit attention in the training regime—unlike FlashAttention3 and SageAttention, which focus exclusively on inference. The SageBwd kernel quantizes six of seven internal matrix multiplications (e.g., QKᵀ, QP, KP, etc.) to INT8 using per-matrix scale factors sX=max(X)/127s_X = \max(|X|)/127 and per-token quantization for the softmax output PP. For PP, a two-level scaling is applied to fit the row-wise dynamic range into INT8:

sP1=rowmax(P~)/(448×6),P~2=P~/sP1 sP2,P^2=ϕ(P~2)s_{P1} = \text{rowmax}(\tilde{P})/(448 \times 6), \quad \tilde{P}_2 = \tilde{P}/s_{P1} \ s_{P2}, \hat{P}_2 = \phi(\tilde{P}_2)

With these, final approximation reconstructs the output as P~P^2×sP2×sP1\tilde{P} \approx \hat{P}_2 \times s_{P2} \times s_{P1}.

Experimental results indicate that 8-bit attention achieves lossless performance on fine-tuning tasks (accuracy, F1, and loss curves remain indistinguishable from BF16), but pretraining sees slower convergence—most notably in the dO·Vᵀ pass, where backward gradients are sensitive to cumulative quantization error. This suggests that while low-bit attention is mature for efficient adaptation, further kernel and optimization advances are needed to close the gap in pretraining regimes.

3. Technical Implementation and Quantization Details

SageAttention3 relies on hardware-optimized quantization:

  • FP4 Quantization: Applied via block-wise scaling and rounding to Q/K/V for inference kernels.
  • INT8 Quantization: Used in both forward and backward passes, with dynamic scaling for each attention matrix and adaptive two-level quantization for softmax probabilities.
  • Sensitive Operations: The dO·Vᵀ pass in backward propagation is retained in FP16 to avoid major gradient distortion, based on empirical findings that INT8 quantization here can impede learning dynamics.

Implementations exploit Triton and CUDA kernel fusion, minimize intermediate memory moves, and integrate optimally with leading transformer libraries (e.g., Hugging Face, TIMM, CogVideoX). This design ensures minimal I/O overhead and maximal hardware utilization.

4. Benchmarking, Performance, and Comparative Analysis

SageAttention3 delivers:

  • Inference: $2.4$–3×3\times end-to-end speedup on RTX5090 (CogVideoX, HunyuanVideo). FP4 kernel accelerations of $4$–5×5\times over FlashAttention2 and $8$–11×11\times over xformers.
  • Fine-Tuning: Maintains baseline accuracy and convergence.
  • Memory and Energy Savings: Substantial reductions in memory footprint due to low-bit representation and faster kernel compute, with plug-and-play integration into established models.

The FP4 approach surpasses SageAttention (INT8) and FlashAttention3 (FP8) on new hardware, especially in mixed-modality and long-sequence model families. The 8-bit training kernel sets a precedent for future deep learning training on specialized quantized hardware.

Attention Type Precision Inference Speedup Fine-Tuning Performance Pretraining Convergence
SageAttention3 FP4/INT8 4–11× Lossless Slower
SageAttention INT8 2–6× N/A N/A
FlashAttention3 FP8 N/A N/A

5. Impact, Applications, and Future Directions

SageAttention3 advances hardware-aware quantized attention and is directly applicable to production settings requiring rapid response time, large context windows, and complex multimodal generation. Key domains include:

  • LLMs for text, image/video synthesis
  • Real-time inference tasks (dialogue, search, streaming synthesis)
  • Efficient fine-tuning in environments with moderate data budgets
  • Future low-bit hardware architectures

A plausible implication is that further kernel innovation and adaptive quantization could make low-bit pretraining competitive with full-precision learning. This would reduce energy usage and deployment barriers for very large models.

6. Limitations and Open Problems

While SageAttention3 offers demonstrably lossless fine-tuning and efficient inference, slower pretraining convergence highlights the ongoing challenge of backward gradient quantization. The work identifies dO·Vᵀ sensitivity as a key limiting factor. Further research is needed to:

  • Develop stable 8-bit backward passes and address cumulative error in gradient flows
  • Optimize for emerging GPU architectures with ever-lower precision arithmetic
  • Explore quantized attention in unsupervised and self-supervised learning dynamics

Such directions are likely to influence both hardware design and algorithm development for next-generation deep learning systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SageAttention3.