SageAttention2++: Efficient Transformer Computation

Updated 28 December 2025

SageAttention2++ is an attention mechanism that introduces FP8 matrix multiplication with FP16 accumulation to significantly accelerate transformer models.
It reduces quadratic complexity in the PV matrix multiplication, delivering up to 4× faster computation compared to standard FP16 methods with minimal accuracy loss.
The approach leverages advanced GPU tensor core instructions and quantization techniques, achieving up to 3.9× speedup across language, image, and video generation tasks.

SageAttention2++ is an attention computation mechanism designed for efficient acceleration of transformer models by leveraging quantization and advanced GPU tensor core instructions. Building upon SageAttention2, SageAttention2++ introduces FP8 (E4M3) matrix multiplication with FP16 accumulation, delivering substantial gains in throughput and memory efficiency without significant degradation in model accuracy. The methodology targets the quadratic complexity bottleneck of standard attention by optimizing the performance-critical PV matrix multiplication, making it suitable for large-scale language, image, and video generation models (Zhang et al., 27 May 2025).

1. Motivation and Predecessors

Standard scaled-dot-product attention exhibits $\mathcal{O}(L^2)$ time and memory complexity for sequence length $L$ , imposing severe performance and feasibility constraints at long context lengths typical in modern foundational models. FlashAttention and its successors reduced memory usage via tiling and online softmax computation but continued to depend on high-precision (FP16/FP32) accumulators for core matrix multiplications (MatMuls).

SageAttention2 advanced this paradigm by quantizing the second attention MatMul (PV) to FP8 (E4M3) and deploying the GPU tensor core instruction mma.f32.f8.f8.f32 (FP8 inputs with FP32 accumulator). This quantization achieved approximately $2\times$ speedup over FP16 MatMul, albeit limited by the throughput ceiling of FP32 accumulators.

2. FP8 MatMul with FP16 Accumulation

The core enhancement in SageAttention2++ is the exploitation of a novel tensor core opcode introduced in NVIDIA Ada and later architectures: mma.f16.f8.f8.f16. This instruction multiplies FP8 (E4M3) operands and accumulates the result in FP16, offering increased computational throughput—measured at approximately $4\times$ faster than the traditional FP16×FP16→FP32 tensor core and $2\times$ faster than the mma.f32.f8.f8.f32 kernel used in SageAttention2.

Quantization Range Mapping and Accumulation Constraints

A real tensor $X$ is quantized to an 8-bit integer representation:

$x_q = \mathrm{round}(X / s) + z$

where $s$ is a positive scale factor and $z$ is a zero-point (set to 0 for symmetric FP8 quantization). The dequantized value is

$\hat{X} = s \cdot (x_q - z)$

In SageAttention2, $P$ and $V$ tensors occupy the full E4M3 FP8 representable range ([–448, 448]), with respective scale factors $\delta_P = \max|P| / 448$ and $\delta_V = \max|V| / 448$ .

With FP16 accumulation width ( $[-65504, 65504]$ ) and $k=32$ products per accumulation, overflow avoidance requires

$|k \cdot p_{\max} \cdot v_{\max}| \leq 65504 \implies P_r \cdot V_r \leq 2047$

A delayed FP32 buffering, where FP16 accumulators are merged, further tightens this constraint to $P_r \cdot V_r \leq 1023$ .

3. Attention Pipeline and Implementation Modifications

SageAttention2++ retains the FlashAttention tiling and I/O-aware online softmax, with targeted adaptations to exploit quantized MatMul and advanced instruction-level parallelism:

Quantization Stages:
- $Q$ , $K$ quantized to INT4 or INT8 per tile, using dynamic block-level scale factors.
- $P$ quantized to FP8 (E4M3) per block, scale $\delta_P = \max|\widetilde{P}| / P_r$ .
- $V$ quantized to FP8 (E4M3) per channel, scale $\delta_V = \max|V_\text{col}| / V_r$ .
PV MatMul Kernel:
- Utilizes mma.m16n8k32 (FP8×FP8→FP16) for operating on quantized FP8 blocks.
- Performs two sequential $k=32$ accumulations, retaining partial sums in FP16 registers. Only after both accumulations are computed does conversion to FP32 occur, after which scaling by $\delta_P \cdot \delta_V$ is applied.
- Fuses accumulation scheduling to reduce FP16→FP32 conversion overhead and global memory store barriers.
Memory Layout:
- $P$ and $V$ buffers consist of 8-bit E4M3 values with paired 16-bit partial accumulators, minimizing conversion instructions and optimizing kernel scheduling (each CTA processes two FP16 accumulators before global store).

Pseudocode Sketch: PV Block Multiply

for each tile i, j:
    load P_i (FP8), V_j (FP8)
    acc1_fp16 = mma.m16n8k32(P_i[0..31], V_j[0..31])
    acc2_fp16 = mma.m16n8k32(P_i[32..63], V_j[32..63])
    acc_fp32 = to_fp32(acc1_fp16 + acc2_fp16)
    O_block = acc_fp32 * (δ_P * δ_V)

4. Empirical Evaluation

Microbenchmarks

On RTX 4090 and 5090 GPUs (head-dim = 128, sequence length up to 4k):

mma.f16.f8.f8.f16 instruction is $4\times$ faster than FP16 tensor core (mma.f16.f16.f16.f16).
SageAttention2++ is $1.7\times$ faster than SageAttention2 on the PV stage and improves quantized attention kernel throughput by $1.3$– $1.5\times$ .
Against FlashAttention2 (FP16 accumulation), SageAttention2++ delivers up to $3.9\times$ kernel speedup with the INT4+FP8 variant ( $\sim$ 3.0× with INT8+FP8).

End-to-End Results

SageAttention2++ has been integrated into state-of-the-art generative models, demonstrating the following speedups over FlashAttention2:

Model Domain	Example Models	Speedup	Evaluation Metrics
Language	Llama 3.1 8B	3.5–3.9×	Perplexity, Accuracy
Video	CogVideoX 2B, HunyuanVideo, Wan	3.5–3.9×	CLIPSIM, CLIP-T, VQA-a, VQA-t, Flow-score
Image	Flux, Stable-Diffusion 3.5	3.5–3.9×	FID, sFID, CLIP, ImageReward

Latency curves for SageAttention2++ remain below those for FlashAttention2 across all context lengths.

Accuracy Retention

Cosine similarity and relative L1 between SageAttention2++ and FP32 attention output: $>$ 99.97% and $L_1 \approx 0.0186$ , respectively (identical to SageAttention2).
End-to-end metric differences are negligible ( $\Delta$ Ppl $<$ 0.01, $\Delta$ FID $<$ 1.0), validating that range narrowing ( $P_r=112$ , $V_r=4.5$ ) preserves model fidelity.

5. Numerical Stability, Trade-offs, and Resource Efficiency

Numerical Stability: Limiting $|p| \leq 112$ , $|v| \leq 4.5$ guarantees accumulator safety and precludes FP16 overflow, albeit with a reduction in dynamic range. Empirically, these constraints yield negligible error across typical attention distributions. In rare pathological cases exhibiting extremely peaky $P$ or $V$ , quantization clipping may occur.
Dynamic Range vs. Speed: Expanding the product $P_r \cdot V_r$ would require more frequent FP32 conversion, negating throughput gains. The selected ($112$, $4.5$) setting balances stability and computational efficiency.
Memory and Energy: FP8 storage halves tensor memory relative to FP16. FP16 accumulation halves energy per operation versus FP32. Coupled with $4\times$ faster tensor core throughput, SageAttention2++ reduces both memory footprint and inference energy.

6. Conclusions and Prospective Directions

SageAttention2++ demonstrates acceleration of PV MatMul by up to $1.7\times$ over SageAttention2 and $3.9\times$ over FlashAttention, with no meaningful loss in accuracy. Key methodological components include the range-narrowed FP8 quantization, delayed FP32 buffering, and kernel fusion for reduced conversion overhead. Notable avenues for future exploration include:

Automatic per-layer selection of quantization range parameters to accommodate outliers,
Investigation of lower-bit or mixed-precision (e.g., FP10) accumulators,
Combination with sparse or linear attention variants to further mitigate $\mathcal{O}(L^2)$ scaling,
Extending SageAttention2++ principles to mixed-precision training via back-propagation (Zhang et al., 27 May 2025).

PDF Markdown Chat (Pro)

References (1)

SageAttention2++: A More Efficient Implementation of SageAttention2 (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SageAttention2++.