SageAttention2: Fast Quantized Attention
- SageAttention2 is an efficient attention mechanism that employs low-bit quantization to accelerate QK^T and PV computations in standard attention models.
- It utilizes INT4/INT8 for Q/K and FP8 for P/V with FP16 accumulation, achieving up to 3.9× faster throughput compared to FlashAttention2.
- Empirical benchmarks across language, image, and video tasks demonstrate significant speedups while maintaining nearly identical accuracy to full-precision models.
SageAttention2 is an efficient attention mechanism designed to address the quadratic time complexity of standard attention operations, specifically by leveraging low-bit quantization to accelerate the critical matrix multiplications within attention kernels. The method targets both the computation for similarity scores and the computation for weighted value aggregation, quantizing on-the-fly and optimizing for NVIDIA TensorCore architecture. SageAttention2++ introduces further optimizations, achieving substantial kernel throughput improvements by utilizing FP8 multiplication with FP16 accumulation, minimizing range overflow, and reducing data conversion overhead. Empirical results demonstrate that SageAttention2++ attains up to a 3.9× speedup over FlashAttention2, with negligible degradation in end-to-end metrics for language, image, and video generation tasks (Zhang et al., 27 May 2025).
1. Core Algorithmic Modifications
SageAttention2++ modifies the matrix multiplication paradigm used in SageAttention2. In typical attention implementations (e.g., FlashAttention-style), computational expense per head is dominated by two operations:
- : Shapes
- : Shapes (where is head-value dimension)
SageAttention2 quantizes and as INT4/INT8 and and as FP8 (E4M3), performing multiplication with accumulation in FP32 registers using the mma.f32.f8.f8.f32 TensorCore instruction. SageAttention2++ replaces this with FP8 × FP8 FP16 accumulation via mma.f16.f8.f8.f16, supported by Ada-generation GPUs such as RTX 4090 and RTX 5090. This change yields approximately 4× faster throughput compared to native FP16 MatMul and 2× faster than FP8 FP32 accumulation.
Range narrowing is required because FP16 accumulators can overflow (). To control this, and are bounded such that , with , chosen so that delayed FP32 buffering remains in range: . Delayed FP32 buffering reduces PTX conversion cost by accumulating two FP16 results before conversion.
2. Quantization Methodology and Formulas
SageAttention2++ employs block-wise quantization for with specific schemes:
- quantization: INT4/INT8 block-wise. Given ,
Reconstruction:
- quantization: FP8 (E4M3) per block. Softmax-attention ,
with .
- quantization: FP8 (E4M3) per-channel,
with .
- accumulation: FP16 via TensorCore,
Together:
3. Computational Complexity and Throughput
Full-precision attention operates at for and for per head. SageAttention2++ maintains these asymptotic bounds but improves the constant factor. Defining as FP16 MatMul throughput and as FP8FP8FP16 throughput:
for . For overall end-to-end kernel time :
with , leading to observed kernel speedups of 3–3.9× over FlashAttention2.
4. Empirical Benchmarks
Tests on NVIDIA RTX 4090 and 5090 (Ada Lovelace) with head-dimensions 64 and 128 and sequence lengths up to 8k revealed peak kernel speedups:
- SageAttention2++(4+8) (INT4 , FP8 ): 3.9× vs. FlashAttention2
- SageAttention2++(8+8) (INT8 , FP8 ): 3.0×
Consistent gains were observed for both causal and non-causal masks. End-to-end metrics for representative models:
| Model | Attention Variant | Perplexity or Metric (Delta vs FP32) |
|---|---|---|
| Llama3.1(8B) (language) | Full-prec | Ppl 6.013 |
| SageAttn2 | Ppl 6.019 | |
| SageAttn2++(8+8) | Ppl 6.020 | |
| CogvideoX(2B) textvideo | Full-prec | CLIPSim 0.179 / FScore 4.974 |
| SageAttn2(8+8) | CLIPSim 0.178 / FScore 4.899 | |
| SageAttn2++(8+8) | CLIPSim 0.179 / FScore 4.386 | |
| Flux/StableDiffusion3.5 textimage | Full-prec vs SageAttn2++ | 0.5 FID, 0.02 sFID |
Across language, image, and video models, SageAttention2++(8+8) matches SageAttention2's metrics, while (4+8) variant incurs only slight degradation.
5. Architectural Integration and Implementation
SageAttention2++ functions as a drop-in replacement for the kernel in FlashAttention-style fused kernels. Two new CUDA kernels (for the two quantization modes) are required, invoked in lieu of torch.flash_attn(). Hardware compatibility mandates support for mma.f16.f8.f8.f16 instructions (Ada Lovelace or later GPUs) and sufficient shared memory for quantized blocks; tiling follows FlashAttention conventions.
FP8 matmuls introduce two implementation caveats:
- FP16 accumulation range () requires range-narrowing and block-wise scale factors.
- Delayed FP32 buffering maintains efficiency but needs careful PTX scheduling.
6. Conclusions and Future Directions
SageAttention2++ demonstrates that employing FP16-accumulating FP8 MatMul, together with well-designed quantization bounds and FP16 buffering, can yield up to 4× kernel throughput improvements with negligible accuracy loss. Evaluations on tasks spanning language modeling, image generation, and video synthesis reveal up to a 3.9× reduction in kernel latency relative to FlashAttention2 while maintaining nearly identical performance metrics as SageAttention2.
Areas proposed for future investigation include:
- Lower-bit accumulation, such as FP4FP8FP16 chains
- Dynamic range adaptation per token/block for improved quantization efficiency
- Integration with sparse or linear attention paradigms
- Exploiting new hardware instructions (e.g., Hopper asynchronous FP8)
The reference implementation is slated for release at https://github.com/thu-ml/SageAttention (Zhang et al., 27 May 2025).