INT-FlashAttention: INT8 Self-Attention Operator

Updated 26 January 2026

INT-FlashAttention is a novel self-attention operator that uses fully INT8 quantization to reduce memory traffic and accelerate inference for long-sequence workloads.
It leverages token-level symmetric quantization and IO-aware GPU tiling, achieving up to 73% speedup and significantly lower quantization error compared to FP16 and FP8 methods.
Performance benchmarks on NVIDIA Ampere GPUs demonstrate up to 50% memory savings and substantial improvements in both speed and accuracy for LLM inference.

INT-FlashAttention is an exact self-attention operator for LLM inference, implementing fully INT8 quantization compatible with the FlashAttention forward workflow. Designed for NVIDIA Ampere GPUs where FP8 Tensor Core routines are unavailable, INT-FlashAttention leverages token-level quantization and IO-aware GPU tiling to simultaneously minimize memory traffic, maximize hardware utilization, and accelerate wall-clock runtime for long-sequence attention workloads. It achieves up to 73% speedup and 82% lower quantization error compared to FlashAttention with FP16 and FP8 inputs, while retaining compatibility with other quantization formats.

1. Motivation and GPU Memory Hierarchy

Transformer self-attention conventionally incurs quadratic $O(N^2)$ time and memory complexity in the sequence length $N$ . Standard attention layers materialize score and probability matrices $S=QK^T$ and $P=\mathrm{softmax}(S)$ in GPU high-bandwidth memory (HBM), making them bandwidth-bound and limiting feasible sequence lengths. FlashAttention (Dao et al., 2022) reorganizes computation to stream blocks of $Q$ , $K$ , $V$ between HBM and on-chip SRAM, keeping intermediate softmax statistics and outputs $O$ in fast memory, and reducing the memory footprint to $O(N)$ .

Post-training quantization (PTQ) to reduced precision e.g., FP16, FP8 or INT8 enables lower memory usage and energy consumption, provided the underlying hardware exposes fast matrix multiplication (GEMM) primitives for these types. NVIDIA Ampere (A100, RTX4090) supports highly optimized INT8 Tensor Core GEMMs but not FP8 cores. INT-FlashAttention exploits these Ampere-specific features by quantizing all $Q$ , $K$ , $V$ activations to INT8, storing them on HBM, and performing all matmuls and memory traffic in INT8, thus doubling the tile sizes per block and halving memory IO over FP16 FlashAttention.

2. Quantization Scheme

INT-FlashAttention utilizes symmetric linear quantization with no learnable zero-point offsets. Quantization is performed at the token (row) granularity for $Q$ and $K$ , and globally for $V$ :

Define INT8 representable range $R = 127$ .
For each row $i$ in $Q$ ( $Q_{i,j} \in \mathbb{R}$ ), compute the scale $s_{Q,i} = \max_j |Q_{i,j}|/R$ . The quantized value is $Q_{i,j}^{[8]} = \mathrm{clip}(\mathrm{round}(Q_{i,j}/s_{Q,i}),-R,R)$ .
Analogously for $K$ : $s_{K,j} = \max_\ell |K_{j,\ell}|/R$ , $K_{j,\ell}^{[8]} = \mathrm{clip}(\mathrm{round}(K_{j,\ell}/s_{K,j}),-R,R)$ .
For $V$ , use a single scale: $s_V = \max_{i,j}|V_{i,j}|/R$ , $V_{i,j}^{[8]} = \mathrm{clip}(\mathrm{round}(V_{i,j}/s_V),-R,R)$ .

At inference, the INT8 matmul is scaled to approximate the floating-point result:

$(Q^8 K^{8T})_{u,v} \approx \frac{1}{s_{Q,u} s_{K,v}}(Q K^T)_{u,v}.$

The softmax output $P_{u,v}$ is quantized by $s_P=1/R$ since $P \in (0,1]$ : $P_{u,v}^{[8]} = \mathrm{round}(R \exp(S_{u,v} - m_u))$ .

3. Forward Pass Algorithm

The INT-FlashAttention forward pass augments standard FlashAttention (Dao et al., 2022) as follows:

for i in 0 ... ceil(N/B_r) - 1:
    Qi8 = Q^8[i⋅B_r : (i+1)⋅B_r]; siQ = s_Q[i⋅B_r : (i+1)⋅B_r]
    m = -∞ × ones(B_r); l = 0 × ones(B_r)
    Oacc = 0 ∈ ℝ^{B_r×d}
    for j in 0 ... ceil(N/B_c) - 1:
        Kj8 = K^8[j⋅B_c : (j+1)⋅B_c]; Vj8 = V^8[j⋅B_c : (j+1)⋅B_c]
        sjK = s_K[j⋅B_c : (j+1)⋅B_c]
        S_int32 = INT8GEMM(Qi8, Kj8^T)
        S = (1/(siQ⊗sjK)) * S_int32  # float32 broadcast
        new_m = max(m, rowmax(S))
        P_block^8 = round(R * exp(S - new_m[:, None]))
        l = l * exp(m - new_m) + rowsum(P_block^8)
        Oacc = Oacc * diag(exp(m - new_m)) + (1/R)*(1/s_V)*INT8GEMM(P_block^8, Vj8)
        m = new_m
    O[i⋅B_r : (i+1)⋅B_r] = diag(1/l) * Oacc

Key distinctions compared to FP16 FlashAttention include: storage of $Q,K,V$ as INT8 on HBM, matmuls via INT8×INT8→INT32 Tensor Core kernels, and on-the-fly INT8 quantization of softmax exponentials.

4. Token-Level Range Estimation and Error Control

Whole-tensor quantization leads to substantial clipping error due to dynamic range variability across individual tokens (rows). INT-FlashAttention records per-row maxima during PTQ calibration: $s_{Q,i} = \max_j |Q_{i,j}|/R$ (similarly for $K$ ), empirically limiting worst-case clipping to $<0.3\%$ token values. The online softmax statistics ( $m_i, l_i$ ) are maintained in FP32, while the exponential values get quantized to INT8 due to their bounded dynamic range.

End-to-end mean relative error (MRE) is:

$<1\%$ for half-INT8 (Q, K INT8, V FP16)
$\sim 4\%$ for full INT8
$7-9\%$ for block-level FP8 quantization

A plausible implication is that token-level INT8 quantization achieves superior accuracy compared to tensor-level FP8 formats for standard LLM attention inputs.

5. Performance Benchmarks

On NVIDIA Ampere devices (A100, RTX4090), INT-FlashAttention exhibits substantial empirical speedup and reduced error.

Inference Speed

Sequence Length (N)	FlashAttention (FP16, baseline)	INT-FlashAttention (INT8)	Speedup
1K	$1.00\times$	$0.69\times$	31%
2K	$1.00\times$	$0.48\times$	52%
4K	$1.00\times$	$0.34\times$	66%
8K	$1.00\times$	$0.28\times$	72%
16K	$1.00\times$	$0.27\times$	73%

Quantization Error (Mean Relative Error, MRE)

Distribution	FP8 Block-Level	INT-FlashAttn Half-INT8	INT-FlashAttn Full-INT8
$N(0,1)$	7.5%	0.8%	4.2%
$U(-0.5,0.5)$	9.0%	0.3%	1.7%

Q, K, V storage in INT8 ($1$ byte vs $2$ bytes FP16) yields 50% parameter memory savings. Intermediate softmax blocks are also held in INT8.

6. Compatibility, Limitations, and Trade-offs

The token-level quantization workflow generalizes to INT4, INT2, or any $b$ -bit format by modifying $R = 2^{b-1} - 1$ and performing calibration; however, INT4 matmul kernels currently lag in maturity. V uses a global scale; extending to per-block or per-row quantization may further reduce quantization error. The error-speed trade-off between half-INT8 (Q, K INT8; V FP16) and full INT8 offers options for accuracy-sensitive deployments.

Exponent underflow or overflow for extremely long sequences can arise in softmax; INT-FlashAttention mitigates this via float32 accumulator rescaling. Applicability is optimal for inference on Ampere GPUs. For Hopper (FP8 Tensor Core) architectures, FP8-based FlashAttention may be preferable.

7. Relation to Prior Work

FlashAttention (Dao et al., 2022) introduced IO-aware tiling and online softmax, reducing memory footprint and enabling exact long-context attention. INT-FlashAttention advances this paradigm by integrating fully INT8 quantization, maximizing the performance of integer GEMM hardware present in Ampere GPUs and extending the utility of FlashAttention for production-scale inference. INT-FlashAttention is the first operator to implement fully INT8 input attention computation with a forward workflow compatible with FlashAttention tiling and online softmax.

A plausible implication is that future Transformer inference workloads on memory- and bandwidth-constrained hardware will favor architectures supporting token-level quantization schemes, as demonstrated in INT-FlashAttention (Chen et al., 2024).

Markdown Upgrade to Chat

References (2)

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022)

INT-FlashAttention: Enabling Flash Attention for INT8 Quantization (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to INT-FlashAttention.