SageAttention2: Fast Quantized Attention

Updated 16 January 2026

SageAttention2 is an efficient attention mechanism that employs low-bit quantization to accelerate QK^T and PV computations in standard attention models.
It utilizes INT4/INT8 for Q/K and FP8 for P/V with FP16 accumulation, achieving up to 3.9× faster throughput compared to FlashAttention2.
Empirical benchmarks across language, image, and video tasks demonstrate significant speedups while maintaining nearly identical accuracy to full-precision models.

SageAttention2 is an efficient attention mechanism designed to address the quadratic time complexity of standard attention operations, specifically by leveraging low-bit quantization to accelerate the critical matrix multiplications within attention kernels. The method targets both the $QK^T$ computation for similarity scores and the $PV$ computation for weighted value aggregation, quantizing on-the-fly and optimizing for NVIDIA TensorCore architecture. SageAttention2++ introduces further optimizations, achieving substantial kernel throughput improvements by utilizing FP8 multiplication with FP16 accumulation, minimizing range overflow, and reducing data conversion overhead. Empirical results demonstrate that SageAttention2++ attains up to a 3.9× speedup over FlashAttention2, with negligible degradation in end-to-end metrics for language, image, and video generation tasks (Zhang et al., 27 May 2025).

1. Core Algorithmic Modifications

SageAttention2++ modifies the matrix multiplication paradigm used in SageAttention2. In typical attention implementations (e.g., FlashAttention-style), computational expense per head is dominated by two operations:

$QK^T$ : Shapes $N \times d \times d \times N \to N \times N$
$PV$ : Shapes $N \times d \times d \times D \to N \times D$ (where $D$ is head-value dimension)

SageAttention2 quantizes $Q$ and $K$ as INT4/INT8 and $P$ and $V$ as FP8 (E4M3), performing $PV$ multiplication with accumulation in FP32 registers using the mma.f32.f8.f8.f32 TensorCore instruction. SageAttention2++ replaces this with FP8 × FP8 $\rightarrow$ FP16 accumulation via mma.f16.f8.f8.f16, supported by Ada-generation GPUs such as RTX 4090 and RTX 5090. This change yields approximately 4× faster throughput compared to native FP16 MatMul and 2× faster than FP8 $\rightarrow$ FP32 accumulation.

Range narrowing is required because FP16 accumulators can overflow ( ${|\text{FP16}_{\mathrm{max}}|} \sim 65,504$ ). To control this, $p$ and $v$ are bounded such that $32 \cdot p \cdot v \leq 65,504$ , with $P_r = 112$ , $V_r = 4.5$ chosen so that delayed FP32 buffering remains in range: $P_r \cdot V_r \leq 2047/2$ . Delayed FP32 buffering reduces PTX conversion cost by accumulating two FP16 results before conversion.

2. Quantization Methodology and Formulas

SageAttention2++ employs block-wise quantization for $Q, K, P, V$ with specific schemes:

$Q, K$ quantization: INT4/INT8 block-wise. Given $Q_i \in \mathbb{R}^{B \times d}$ ,

$\delta_Q = \max_i |Q_i| / S_Q, \quad \hat{Q}_i = \operatorname{round}(Q_i / \delta_Q), \quad S_Q = 15 \text{ for INT4},\ 127 \text{ for INT8}$

Reconstruction: $Q_i \approx \hat{Q}_i \cdot \delta_Q$

$P$ quantization: FP8 (E4M3) per block. Softmax-attention $P = \mathrm{softmax}(QK^T)$ ,

$\delta_P = \max |P| / P_r, \quad \hat{P} = \operatorname{round}(P / \delta_P)$

with $P_r = 112$ .

$V$ quantization: FP8 (E4M3) per-channel,

$\delta_V(j) = \max_{\text{row}} |V(:, j)| / V_r,\quad \hat{V}(:, j) = \operatorname{round}(V(:, j) / \delta_V(j))$

with $V_r = 4.5$ .

$PV$ accumulation: FP16 via TensorCore,

$\text{out}_{\text{FP16}} = \text{mma.f16.f8.f8.f16}(\hat{P}, \hat{V}),\quad O = \text{out}_{\text{FP16}} \cdot \delta_P \cdot \delta_V$

Together:

$\hat{P} = \lfloor P / \delta_P \rceil, \quad \hat{V} = \lfloor V / \delta_V \rceil, \quad O = (\langle \hat{P}, \hat{V} \rangle_{\mathrm{FP16}\ \text{acc.}})\cdot\delta_P\delta_V$

3. Computational Complexity and Throughput

Full-precision attention operates at $O(N^2 d)$ for $QK^T$ and $O(NdD)$ for $PV$ per head. SageAttention2++ maintains these asymptotic bounds but improves the constant factor. Defining $F_{\mathrm{FP16}}$ as FP16 MatMul throughput and $F_{\mathrm{F8F16}}$ as FP8 $\times$ FP8 $\rightarrow$ FP16 throughput:

$\text{Speedup} \approx F_{\mathrm{F8F16}} / F_{\mathrm{FP16}} \approx 4\times$

for $PV$ . For overall end-to-end kernel time $T$ :

$T_{\mathrm{full}} \approx \frac{C_{QK} N^2 d}{F_{\mathrm{FP16}}} + \frac{C_{PV} N d D}{F_{\mathrm{FP16}}}$

$T_{\mathrm{SA2++}} \approx \frac{C_{QK} N^2 d}{F_{\mathrm{INT4}}} + \frac{C_{PV} N d D}{F_{\mathrm{F8F16}}}$

with $F_{\mathrm{INT4}} \gtrsim 2\times F_{\mathrm{FP16}}$ , leading to observed kernel speedups of 3–3.9× over FlashAttention2.

4. Empirical Benchmarks

Tests on NVIDIA RTX 4090 and 5090 (Ada Lovelace) with head-dimensions 64 and 128 and sequence lengths up to 8k revealed peak kernel speedups:

SageAttention2++(4+8) (INT4 $QK$ , FP8 $PV$ ): $\sim$ 3.9× vs. FlashAttention2
SageAttention2++(8+8) (INT8 $QK$ , FP8 $PV$ ): $\sim$ 3.0×

Consistent gains were observed for both causal and non-causal masks. End-to-end metrics for representative models:

Model	Attention Variant	Perplexity or Metric (Delta vs FP32)
Llama3.1(8B) (language)	Full-prec	Ppl 6.013
	SageAttn2	Ppl 6.019
	SageAttn2++(8+8)	Ppl 6.020
CogvideoX(2B) text $\to$ video	Full-prec	CLIPSim 0.179 / FScore 4.974
	SageAttn2(8+8)	CLIPSim 0.178 / FScore 4.899
	SageAttn2++(8+8)	CLIPSim 0.179 / FScore 4.386
Flux/StableDiffusion3.5 text $\to$ image	Full-prec vs SageAttn2++	$\lesssim$ 0.5 FID, $\lesssim$ 0.02 sFID

Across language, image, and video models, SageAttention2++(8+8) matches SageAttention2's metrics, while (4+8) variant incurs only slight degradation.

5. Architectural Integration and Implementation

SageAttention2++ functions as a drop-in replacement for the $PV$ kernel in FlashAttention-style fused kernels. Two new CUDA kernels (for the two quantization modes) are required, invoked in lieu of torch.flash_attn(). Hardware compatibility mandates support for mma.f16.f8.f8.f16 instructions (Ada Lovelace or later GPUs) and sufficient shared memory for quantized $P, V$ blocks; tiling follows FlashAttention conventions.

FP8 matmuls introduce two implementation caveats:

FP16 accumulation range ( $\sim \pm 6.55\times 10^4$ ) requires range-narrowing and block-wise scale factors.
Delayed FP32 buffering maintains efficiency but needs careful PTX scheduling.

6. Conclusions and Future Directions

SageAttention2++ demonstrates that employing FP16-accumulating FP8 MatMul, together with well-designed quantization bounds and FP16 buffering, can yield up to 4× kernel throughput improvements with negligible accuracy loss. Evaluations on tasks spanning language modeling, image generation, and video synthesis reveal up to a 3.9× reduction in kernel latency relative to FlashAttention2 while maintaining nearly identical performance metrics as SageAttention2.

Areas proposed for future investigation include:

Lower-bit accumulation, such as FP4 $\to$ FP8 $\to$ FP16 chains
Dynamic range adaptation per token/block for improved quantization efficiency
Integration with sparse or linear attention paradigms
Exploiting new hardware instructions (e.g., Hopper asynchronous FP8)

The reference implementation is slated for release at https://github.com/thu-ml/SageAttention (Zhang et al., 27 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SageAttention2++: A More Efficient Implementation of SageAttention2 (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SageAttention2.