Papers
Topics
Authors
Recent
2000 character limit reached

SageAttention2++: Efficient Transformer Computation

Updated 28 December 2025
  • SageAttention2++ is an attention mechanism that introduces FP8 matrix multiplication with FP16 accumulation to significantly accelerate transformer models.
  • It reduces quadratic complexity in the PV matrix multiplication, delivering up to 4× faster computation compared to standard FP16 methods with minimal accuracy loss.
  • The approach leverages advanced GPU tensor core instructions and quantization techniques, achieving up to 3.9× speedup across language, image, and video generation tasks.

SageAttention2++ is an attention computation mechanism designed for efficient acceleration of transformer models by leveraging quantization and advanced GPU tensor core instructions. Building upon SageAttention2, SageAttention2++ introduces FP8 (E4M3) matrix multiplication with FP16 accumulation, delivering substantial gains in throughput and memory efficiency without significant degradation in model accuracy. The methodology targets the quadratic complexity bottleneck of standard attention by optimizing the performance-critical PV matrix multiplication, making it suitable for large-scale language, image, and video generation models (Zhang et al., 27 May 2025).

1. Motivation and Predecessors

Standard scaled-dot-product attention exhibits O(L2)\mathcal{O}(L^2) time and memory complexity for sequence length LL, imposing severe performance and feasibility constraints at long context lengths typical in modern foundational models. FlashAttention and its successors reduced memory usage via tiling and online softmax computation but continued to depend on high-precision (FP16/FP32) accumulators for core matrix multiplications (MatMuls).

SageAttention2 advanced this paradigm by quantizing the second attention MatMul (PV) to FP8 (E4M3) and deploying the GPU tensor core instruction mma.f32.f8.f8.f32 (FP8 inputs with FP32 accumulator). This quantization achieved approximately 2×2\times speedup over FP16 MatMul, albeit limited by the throughput ceiling of FP32 accumulators.

2. FP8 MatMul with FP16 Accumulation

The core enhancement in SageAttention2++ is the exploitation of a novel tensor core opcode introduced in NVIDIA Ada and later architectures: mma.f16.f8.f8.f16. This instruction multiplies FP8 (E4M3) operands and accumulates the result in FP16, offering increased computational throughput—measured at approximately 4×4\times faster than the traditional FP16×FP16→FP32 tensor core and 2×2\times faster than the mma.f32.f8.f8.f32 kernel used in SageAttention2.

Quantization Range Mapping and Accumulation Constraints

A real tensor XX is quantized to an 8-bit integer representation:

xq=round(X/s)+zx_q = \mathrm{round}(X / s) + z

where ss is a positive scale factor and zz is a zero-point (set to 0 for symmetric FP8 quantization). The dequantized value is

X^=s(xqz)\hat{X} = s \cdot (x_q - z)

In SageAttention2, PP and VV tensors occupy the full E4M3 FP8 representable range ([–448, 448]), with respective scale factors δP=maxP/448\delta_P = \max|P| / 448 and δV=maxV/448\delta_V = \max|V| / 448.

With FP16 accumulation width ([65504,65504][-65504, 65504]) and k=32k=32 products per accumulation, overflow avoidance requires

kpmaxvmax65504    PrVr2047|k \cdot p_{\max} \cdot v_{\max}| \leq 65504 \implies P_r \cdot V_r \leq 2047

A delayed FP32 buffering, where FP16 accumulators are merged, further tightens this constraint to PrVr1023P_r \cdot V_r \leq 1023.

3. Attention Pipeline and Implementation Modifications

SageAttention2++ retains the FlashAttention tiling and I/O-aware online softmax, with targeted adaptations to exploit quantized MatMul and advanced instruction-level parallelism:

  • Quantization Stages:
    • QQ, KK quantized to INT4 or INT8 per tile, using dynamic block-level scale factors.
    • PP quantized to FP8 (E4M3) per block, scale δP=maxP~/Pr\delta_P = \max|\widetilde{P}| / P_r.
    • VV quantized to FP8 (E4M3) per channel, scale δV=maxVcol/Vr\delta_V = \max|V_\text{col}| / V_r.
  • PV MatMul Kernel:
    • Utilizes mma.m16n8k32 (FP8×FP8→FP16) for operating on quantized FP8 blocks.
    • Performs two sequential k=32k=32 accumulations, retaining partial sums in FP16 registers. Only after both accumulations are computed does conversion to FP32 occur, after which scaling by δPδV\delta_P \cdot \delta_V is applied.
    • Fuses accumulation scheduling to reduce FP16→FP32 conversion overhead and global memory store barriers.
  • Memory Layout:
    • PP and VV buffers consist of 8-bit E4M3 values with paired 16-bit partial accumulators, minimizing conversion instructions and optimizing kernel scheduling (each CTA processes two FP16 accumulators before global store).

Pseudocode Sketch: PV Block Multiply

1
2
3
4
5
6
for each tile i, j:
    load P_i (FP8), V_j (FP8)
    acc1_fp16 = mma.m16n8k32(P_i[0..31], V_j[0..31])
    acc2_fp16 = mma.m16n8k32(P_i[32..63], V_j[32..63])
    acc_fp32 = to_fp32(acc1_fp16 + acc2_fp16)
    O_block = acc_fp32 * (δ_P * δ_V)

4. Empirical Evaluation

Microbenchmarks

On RTX 4090 and 5090 GPUs (head-dim = 128, sequence length up to 4k):

  • mma.f16.f8.f8.f16 instruction is 4×4\times faster than FP16 tensor core (mma.f16.f16.f16.f16).
  • SageAttention2++ is 1.7×1.7\times faster than SageAttention2 on the PV stage and improves quantized attention kernel throughput by $1.3$–1.5×1.5\times.
  • Against FlashAttention2 (FP16 accumulation), SageAttention2++ delivers up to 3.9×3.9\times kernel speedup with the INT4+FP8 variant (\sim3.0× with INT8+FP8).

End-to-End Results

SageAttention2++ has been integrated into state-of-the-art generative models, demonstrating the following speedups over FlashAttention2:

Model Domain Example Models Speedup Evaluation Metrics
Language Llama 3.1 8B 3.5–3.9× Perplexity, Accuracy
Video CogVideoX 2B, HunyuanVideo, Wan 3.5–3.9× CLIPSIM, CLIP-T, VQA-a, VQA-t, Flow-score
Image Flux, Stable-Diffusion 3.5 3.5–3.9× FID, sFID, CLIP, ImageReward

Latency curves for SageAttention2++ remain below those for FlashAttention2 across all context lengths.

Accuracy Retention

  • Cosine similarity and relative L1 between SageAttention2++ and FP32 attention output: >>99.97% and L10.0186L_1 \approx 0.0186, respectively (identical to SageAttention2).
  • End-to-end metric differences are negligible (Δ\DeltaPpl<<0.01, Δ\DeltaFID<<1.0), validating that range narrowing (Pr=112P_r=112, Vr=4.5V_r=4.5) preserves model fidelity.

5. Numerical Stability, Trade-offs, and Resource Efficiency

  • Numerical Stability: Limiting p112|p| \leq 112, v4.5|v| \leq 4.5 guarantees accumulator safety and precludes FP16 overflow, albeit with a reduction in dynamic range. Empirically, these constraints yield negligible error across typical attention distributions. In rare pathological cases exhibiting extremely peaky PP or VV, quantization clipping may occur.
  • Dynamic Range vs. Speed: Expanding the product PrVrP_r \cdot V_r would require more frequent FP32 conversion, negating throughput gains. The selected ($112$, $4.5$) setting balances stability and computational efficiency.
  • Memory and Energy: FP8 storage halves tensor memory relative to FP16. FP16 accumulation halves energy per operation versus FP32. Coupled with 4×4\times faster tensor core throughput, SageAttention2++ reduces both memory footprint and inference energy.

6. Conclusions and Prospective Directions

SageAttention2++ demonstrates acceleration of PV MatMul by up to 1.7×1.7\times over SageAttention2 and 3.9×3.9\times over FlashAttention, with no meaningful loss in accuracy. Key methodological components include the range-narrowed FP8 quantization, delayed FP32 buffering, and kernel fusion for reduced conversion overhead. Notable avenues for future exploration include:

  • Automatic per-layer selection of quantization range parameters to accommodate outliers,
  • Investigation of lower-bit or mixed-precision (e.g., FP10) accumulators,
  • Combination with sparse or linear attention variants to further mitigate O(L2)\mathcal{O}(L^2) scaling,
  • Extending SageAttention2++ principles to mixed-precision training via back-propagation (Zhang et al., 27 May 2025).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SageAttention2++.