SageAttention2++: A More Efficient Implementation of SageAttention2 (2505.21136v3)

Published 27 May 2025 in cs.LG, cs.AI, cs.AR, and cs.CV

Abstract: The efficiency of attention is critical because its time complexity grows quadratically with sequence length. SageAttention2 addresses this by utilizing quantization to accelerate matrix multiplications (Matmul) in attention. To further accelerate SageAttention2, we propose to utilize the faster instruction of FP8 Matmul accumulated in FP16. The instruction is 2x faster than the FP8 Matmul used in SageAttention2. Our experiments show that SageAttention2++ achieves a 3.9x speedup over FlashAttention while maintaining the same attention accuracy as SageAttention2. This means SageAttention2++ effectively accelerates various models, including those for language, image, and video generation, with negligible end-to-end metrics loss. The code will be available at https://github.com/thu-ml/SageAttention.

Summary

The paper introduces SageAttention2++, a novel approach that accelerates attention computation by using a faster FP16-accumulated FP8 matrix multiplication instruction.
The method refines quantization ranges and employs Delayed FP32 Buffering to keep accumulated products within FP16 limits and ensure minimal error.
Experimental results demonstrate up to a 3.9x speedup over FlashAttention2 while preserving key performance metrics across text, image, and video generation tasks.

This paper, "SageAttention2++: A More Efficient Implementation of SageAttention2" (2505.21136), focuses on improving the efficiency of the attention mechanism in deep learning models, which is crucial due to its quadratic time complexity with respect to sequence length. SageAttention2++ builds upon previous work, SageAttention2, by leveraging a faster hardware instruction for matrix multiplication (Matmul).

The core idea behind SageAttention2 and SageAttention2++ is to accelerate attention computation through quantization and hardware-optimized kernels, while maintaining full sequence computation unlike linear or sparse attention methods. SageAttention2 quantizes Query (Q) and Key (K) matrices to INT4/INT8 and the intermediate Probability (P) and Value (V) matrices to FP8 before performing matrix multiplications using Tensor Cores. Specifically, for the $PV$ Matmul, SageAttention2 utilizes the mma.f32.f8.f8.f32 instruction, which uses an FP32 accumulator and offers a 2x speedup over FP16 Matmul.

SageAttention2++ identifies that a different FP8 Matmul instruction, mma.f16.f8.f8.f16, which uses an FP16 accumulator, provides a significantly higher speedup of 4x over FP16 Matmul on GPUs like RTX4090 and RTX5090. The main challenge in using this faster instruction is that the FP16 accumulator has a smaller representable range than the FP32 accumulator. The accumulation of 32 products ( $32 \times pv$ , where $p$ and $v$ are from the quantized $\hat P$ and $\hat V$ ) needs to stay within the FP16 range of approximately $(-65504, 65504)$ .

To address this, SageAttention2++ proposes narrowing the quantization ranges of $P$ and $V$ by adjusting their scale factors, $\delta_P$ and $\delta_V$ . The scale factors are defined as $\delta_P = |\max(\widetilde P)| / P_r$ and $\delta_V = |\max(V)| / V_r$ , where $\widetilde P$ is the intermediate probability matrix and $V$ is the value matrix. The values $P_r$ and $V_r$ determine the quantization range, and they must satisfy the constraint $P_r \times V_r \le 65504 / 32 = 2047$ to keep the accumulated products within the FP16 range during the mma.m16n8k32 operation.

Furthermore, SageAttention2++ introduces "Delayed FP32 Buffering". This technique reduces data type conversion overhead by accumulating the results of two consecutive FP16 accumulator Matmul operations in FP16 before converting the final result to FP32. This optimization imposes a stricter constraint on the quantization ranges: $P_r \times V_r \le 2047/2$ . Through experiments, the authors found that using $P_r = 112$ and $V_r = 4.5$ achieved optimal performance while introducing negligible error.

For practical implementation, SageAttention2++ follows the tiled approach of FlashAttention and SageAttention2, utilizing online softmax. The quantization uses per-block granularity for Q, K, and $\widetilde P$ , and per-channel granularity for V, similar to SageAttention2. The core acceleration comes from implementing the $PV$ Matmul using the faster FP16-accumulated FP8 instruction with the adjusted scale factors.

The paper presents experimental results demonstrating the practical benefits of SageAttention2++:

Kernel Speed: SageAttention2++ (specifically the INT4+FP8 variant) achieves up to a 3.9x speedup over FlashAttention2 and is faster than SageAttention and SageAttention2 on RTX4090 and RTX5090 GPUs, across various sequence lengths and head dimensions (64 and 128), with and without causal masks.
End-to-end Performance: Evaluations on diverse models for text (Llama3.1), image (Flux, Stable-Diffusion3.5), and video generation (CogvideoX, HunyuanVideo, Wan) show that SageAttention2++ maintains end-to-end metrics (e.g., Perplexity, Accuracy, FID, CLIPScore) comparable to or negligibly different from SageAttention2 and full-precision models. The INT8+FP8 variant shows almost no metrics loss, while the INT4+FP8 variant might introduce minor losses depending on the task and model.

The implementation is done in CUDA, leveraging specific Tensor Core instructions. The code is stated to be available on a public GitHub repository, indicating its readiness for practical use.

In summary, SageAttention2++ offers a plug-and-play attention implementation that provides substantial speedups for inference by leveraging faster low-precision hardware instructions on modern NVIDIA GPUs. This is achieved by carefully adjusting the quantization ranges to fit the capabilities of the FP16 accumulator while preserving accuracy, making it a viable drop-in replacement for standard attention in various generative models.