Papers
Topics
Authors
Recent
Search
2000 character limit reached

SageAttention2: Fast Quantized Attention

Updated 16 January 2026
  • SageAttention2 is an efficient attention mechanism that employs low-bit quantization to accelerate QK^T and PV computations in standard attention models.
  • It utilizes INT4/INT8 for Q/K and FP8 for P/V with FP16 accumulation, achieving up to 3.9× faster throughput compared to FlashAttention2.
  • Empirical benchmarks across language, image, and video tasks demonstrate significant speedups while maintaining nearly identical accuracy to full-precision models.

SageAttention2 is an efficient attention mechanism designed to address the quadratic time complexity of standard attention operations, specifically by leveraging low-bit quantization to accelerate the critical matrix multiplications within attention kernels. The method targets both the QKTQK^T computation for similarity scores and the PVPV computation for weighted value aggregation, quantizing on-the-fly and optimizing for NVIDIA TensorCore architecture. SageAttention2++ introduces further optimizations, achieving substantial kernel throughput improvements by utilizing FP8 multiplication with FP16 accumulation, minimizing range overflow, and reducing data conversion overhead. Empirical results demonstrate that SageAttention2++ attains up to a 3.9× speedup over FlashAttention2, with negligible degradation in end-to-end metrics for language, image, and video generation tasks (Zhang et al., 27 May 2025).

1. Core Algorithmic Modifications

SageAttention2++ modifies the matrix multiplication paradigm used in SageAttention2. In typical attention implementations (e.g., FlashAttention-style), computational expense per head is dominated by two operations:

  • QKTQK^T: Shapes N×d×d×NN×NN \times d \times d \times N \to N \times N
  • PVPV: Shapes N×d×d×DN×DN \times d \times d \times D \to N \times D (where DD is head-value dimension)

SageAttention2 quantizes QQ and KK as INT4/INT8 and PP and VV as FP8 (E4M3), performing PVPV multiplication with accumulation in FP32 registers using the mma.f32.f8.f8.f32 TensorCore instruction. SageAttention2++ replaces this with FP8 × FP8 \rightarrow FP16 accumulation via mma.f16.f8.f8.f16, supported by Ada-generation GPUs such as RTX 4090 and RTX 5090. This change yields approximately 4× faster throughput compared to native FP16 MatMul and 2× faster than FP8 \rightarrow FP32 accumulation.

Range narrowing is required because FP16 accumulators can overflow (FP16max65,504{|\text{FP16}_{\mathrm{max}}|} \sim 65,504). To control this, pp and vv are bounded such that 32pv65,50432 \cdot p \cdot v \leq 65,504, with Pr=112P_r = 112, Vr=4.5V_r = 4.5 chosen so that delayed FP32 buffering remains in range: PrVr2047/2P_r \cdot V_r \leq 2047/2. Delayed FP32 buffering reduces PTX conversion cost by accumulating two FP16 results before conversion.

2. Quantization Methodology and Formulas

SageAttention2++ employs block-wise quantization for Q,K,P,VQ, K, P, V with specific schemes:

  • Q,KQ, K quantization: INT4/INT8 block-wise. Given QiRB×dQ_i \in \mathbb{R}^{B \times d},

δQ=maxiQi/SQ,Q^i=round(Qi/δQ),SQ=15 for INT4, 127 for INT8\delta_Q = \max_i |Q_i| / S_Q, \quad \hat{Q}_i = \operatorname{round}(Q_i / \delta_Q), \quad S_Q = 15 \text{ for INT4},\ 127 \text{ for INT8}

Reconstruction: QiQ^iδQQ_i \approx \hat{Q}_i \cdot \delta_Q

  • PP quantization: FP8 (E4M3) per block. Softmax-attention P=softmax(QKT)P = \mathrm{softmax}(QK^T),

δP=maxP/Pr,P^=round(P/δP)\delta_P = \max |P| / P_r, \quad \hat{P} = \operatorname{round}(P / \delta_P)

with Pr=112P_r = 112.

  • VV quantization: FP8 (E4M3) per-channel,

δV(j)=maxrowV(:,j)/Vr,V^(:,j)=round(V(:,j)/δV(j))\delta_V(j) = \max_{\text{row}} |V(:, j)| / V_r,\quad \hat{V}(:, j) = \operatorname{round}(V(:, j) / \delta_V(j))

with Vr=4.5V_r = 4.5.

  • PVPV accumulation: FP16 via TensorCore,

outFP16=mma.f16.f8.f8.f16(P^,V^),O=outFP16δPδV\text{out}_{\text{FP16}} = \text{mma.f16.f8.f8.f16}(\hat{P}, \hat{V}),\quad O = \text{out}_{\text{FP16}} \cdot \delta_P \cdot \delta_V

Together:

P^=P/δP,V^=V/δV,O=(P^,V^FP16 acc.)δPδV\hat{P} = \lfloor P / \delta_P \rceil, \quad \hat{V} = \lfloor V / \delta_V \rceil, \quad O = (\langle \hat{P}, \hat{V} \rangle_{\mathrm{FP16}\ \text{acc.}})\cdot\delta_P\delta_V

3. Computational Complexity and Throughput

Full-precision attention operates at O(N2d)O(N^2 d) for QKTQK^T and O(NdD)O(NdD) for PVPV per head. SageAttention2++ maintains these asymptotic bounds but improves the constant factor. Defining FFP16F_{\mathrm{FP16}} as FP16 MatMul throughput and FF8F16F_{\mathrm{F8F16}} as FP8×\timesFP8\rightarrowFP16 throughput:

SpeedupFF8F16/FFP164×\text{Speedup} \approx F_{\mathrm{F8F16}} / F_{\mathrm{FP16}} \approx 4\times

for PVPV. For overall end-to-end kernel time TT:

TfullCQKN2dFFP16+CPVNdDFFP16T_{\mathrm{full}} \approx \frac{C_{QK} N^2 d}{F_{\mathrm{FP16}}} + \frac{C_{PV} N d D}{F_{\mathrm{FP16}}}

TSA2++CQKN2dFINT4+CPVNdDFF8F16T_{\mathrm{SA2++}} \approx \frac{C_{QK} N^2 d}{F_{\mathrm{INT4}}} + \frac{C_{PV} N d D}{F_{\mathrm{F8F16}}}

with FINT42×FFP16F_{\mathrm{INT4}} \gtrsim 2\times F_{\mathrm{FP16}}, leading to observed kernel speedups of 3–3.9× over FlashAttention2.

4. Empirical Benchmarks

Tests on NVIDIA RTX 4090 and 5090 (Ada Lovelace) with head-dimensions 64 and 128 and sequence lengths up to 8k revealed peak kernel speedups:

  • SageAttention2++(4+8) (INT4 QKQK, FP8 PVPV): \sim3.9× vs. FlashAttention2
  • SageAttention2++(8+8) (INT8 QKQK, FP8 PVPV): \sim3.0×

Consistent gains were observed for both causal and non-causal masks. End-to-end metrics for representative models:

Model Attention Variant Perplexity or Metric (Delta vs FP32)
Llama3.1(8B) (language) Full-prec Ppl 6.013
SageAttn2 Ppl 6.019
SageAttn2++(8+8) Ppl 6.020
CogvideoX(2B) text\tovideo Full-prec CLIPSim 0.179 / FScore 4.974
SageAttn2(8+8) CLIPSim 0.178 / FScore 4.899
SageAttn2++(8+8) CLIPSim 0.179 / FScore 4.386
Flux/StableDiffusion3.5 text\toimage Full-prec vs SageAttn2++ \lesssim0.5 FID, \lesssim0.02 sFID

Across language, image, and video models, SageAttention2++(8+8) matches SageAttention2's metrics, while (4+8) variant incurs only slight degradation.

5. Architectural Integration and Implementation

SageAttention2++ functions as a drop-in replacement for the PVPV kernel in FlashAttention-style fused kernels. Two new CUDA kernels (for the two quantization modes) are required, invoked in lieu of torch.flash_attn(). Hardware compatibility mandates support for mma.f16.f8.f8.f16 instructions (Ada Lovelace or later GPUs) and sufficient shared memory for quantized P,VP, V blocks; tiling follows FlashAttention conventions.

FP8 matmuls introduce two implementation caveats:

  • FP16 accumulation range (±6.55×104\sim \pm 6.55\times 10^4) requires range-narrowing and block-wise scale factors.
  • Delayed FP32 buffering maintains efficiency but needs careful PTX scheduling.

6. Conclusions and Future Directions

SageAttention2++ demonstrates that employing FP16-accumulating FP8 MatMul, together with well-designed quantization bounds and FP16 buffering, can yield up to 4× kernel throughput improvements with negligible accuracy loss. Evaluations on tasks spanning language modeling, image generation, and video synthesis reveal up to a 3.9× reduction in kernel latency relative to FlashAttention2 while maintaining nearly identical performance metrics as SageAttention2.

Areas proposed for future investigation include:

  • Lower-bit accumulation, such as FP4\toFP8\toFP16 chains
  • Dynamic range adaptation per token/block for improved quantization efficiency
  • Integration with sparse or linear attention paradigms
  • Exploiting new hardware instructions (e.g., Hopper asynchronous FP8)

The reference implementation is slated for release at https://github.com/thu-ml/SageAttention (Zhang et al., 27 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SageAttention2.