Kernel Fusion and Arithmetic Intensity

Updated 30 March 2026

Kernel fusion is a technique that merges multiple GPU or CPU kernels to retain intermediate results on-chip and reduce redundant off-chip memory accesses.
Arithmetic intensity, defined as the ratio of FLOPs to bytes moved, is increased through fusion strategies, shifting performance from memory-bound to compute-bound regimes.
Fusion methodologies integrate tiling, loop nest analysis, and cost modeling to achieve empirical speedups of 2× to 10× across diverse computational workloads.

Kernel fusion is a code transformation and code generation strategy that merges multiple GPU or CPU kernels into single, larger computational units, thereby reducing the volume of off-chip memory traffic and increasing the arithmetic intensity of the resulting computations. Arithmetic intensity (AI), defined as the ratio of the total number of floating-point operations (FLOPs) to the total number of bytes moved between off-chip (or upper-level) memory and the compute units, is a principal metric in roofline-based performance modeling. Fusion systematically raises AI by keeping intermediate results on-chip rather than on DRAM, unlocking higher throughput, especially on modern architectures where compute throughput outpaces memory bandwidth.

1. Defining Arithmetic Intensity and Its Role in Performance

Arithmetic intensity is expressed as

$I = \frac{\text{Total FLOPs}}{\text{Total Bytes Moved}}$

In a roofline model, peak attainable performance is bottlenecked by either computational throughput (compute-bound regime) or memory bandwidth (memory-bound regime); thus, kernels with low AI are memory-bound, while high-AI kernels can potentially achieve a higher fraction of peak FLOP/s (Zhang et al., 27 Jun 2025, Long et al., 2019, Filipovič et al., 2013, Niu et al., 2021, Yadav et al., 2024). For example, in a General Matrix-Matrix Multiply (GEMM) tile,

$I = \frac{W}{B}$

where $W$ is the number of FLOPs and $B$ is the number of bytes moved between DRAM/main memory and on-chip storage (Zhang et al., 27 Jun 2025).

The movement of a kernel up the roofline—from memory-bound to compute-bound—as AI increases is a principal target of fusion strategies. This can yield kernel performance improvements of 2×–5× in AI and corresponding speedups, as shown empirically across diverse workloads (Zhang et al., 27 Jun 2025, Long et al., 2019, Filipovič et al., 2013, Niu et al., 2021).

2. Principles and Formal Models of Kernel Fusion

The essential idea in kernel fusion is to eliminate redundant memory transactions across consecutive kernels. In the unfused scenario, a chain of $n$ kernels each perform their own global memory loads and stores: $\text{Total Bytes}_{\text{unfused}} = \sum_{i=1}^{n} (\text{Bytes Read}_i + \text{Bytes Written}_i)$ After fusion, intermediate values are retained on-chip (registers, shared memory, or DSM), so only the initial loads and final stores touch off-chip memory: $\text{Total Bytes}_{\text{fused}} = \text{Bytes Read}_1 + \text{Bytes Written}_n$ AI then increases by a factor roughly equal to the number of kernels fused (Sewall et al., 2017, Filipovič et al., 2013, Niu et al., 2021, Amoros et al., 9 Aug 2025).

Fusion frameworks formalize this as an optimization problem, where either an integer linear program (ILP) (Long et al., 2019), a beam search (Zheng et al., 2020), or a combinatorial search space with pruning is constructed (Huang et al., 15 Dec 2025, Zhang et al., 27 Jun 2025). Objective functions operate on estimated or measured AI increases, saved memory, and occupancy/resource penalties.

3. Fusion and Tiling Methodologies

State-of-the-art kernel fusion frameworks (e.g., MCFuser (Zhang et al., 27 Jun 2025), FlashFuser (Huang et al., 15 Dec 2025), FusionStitching (Long et al., 2019, Zheng et al., 2020), Fused Kernel Library (Amoros et al., 9 Aug 2025), and DNNFusion (Niu et al., 2021)) follow a multi-layered approach:

Tiling Strategy: Decompose iteration spaces into tiles, assign tile shapes, and select loop orderings. Tile sizes and orderings have direct effect on both compute (W) and memory (B) in AI computation (Zhang et al., 27 Jun 2025).
Loop Nest and DAG Analysis: Express the computation as a loop-nest DAG or task-based IR to make dependency, sharing, and data reuse explicit (Sewall et al., 2017, Yadav et al., 2024).
Fusion Plan Generation: Solve for maximal sets of fusible subgraphs/patterns, guided by resource constraints (e.g., shared memory, registers, DSM capacity) and legal fusion rules (no data-dep cycles or cross-barrier fusions) (Long et al., 2019, Huang et al., 15 Dec 2025, Niu et al., 2021, Yadav et al., 2024).
On-Chip Allocation and Scheduling: For each fusion candidate, optimize on-chip memory allocation across registers, shared/local memory, and DSM, using buffer sharing, storage contraction, and memory-manager tactics. This maximizes data reuse and limits costly off-chip transfers (Zhang et al., 27 Jun 2025, Huang et al., 15 Dec 2025).
Code Generation: Emit a single, parameterized kernel for the fused computation, directly mapping data movement patterns into code and managing synchronization as required (Long et al., 2019, Amoros et al., 9 Aug 2025).

4. Quantitative Effects: AI Gain and Performance

Kernel fusion yields AI improvements and associated speedups as documented in multiple empirical studies:

MCFuser achieves AI inflations from $\varphi \approx 2$ (unfused small $K$ ) up to $\varphi \approx 50$ (fully fused, large tile, no redundant access), shifting throughput from $0.1 P$ to the compute roof on A100 (Zhang et al., 27 Jun 2025).
FusionStitching boosts AI by a mean of 2.4× per fused kernel (range 1.5–4.2×), with end-to-end speedups up to 5.7× on DL workloads (Long et al., 2019).
DNNFusion reports a 2×–5× increase in AI and up to 3.8× speedups on full models, as in fusing Conv-BN-ReLU or Attention blocks (Niu et al., 2021).
FLASHFuser (exploiting DSM) lifts AI by ~2.4×, reduces DRAM traffic by 58%, and achieves per-kernel speedups over SOTA by 3.3–6.4× (Huang et al., 15 Dec 2025).
TurboFNO captures a 5–10× AI boost by fusing FFT, GEMM, and iFFT operations and coalescing all shared-memory traffic, reducing kernel launches from 5 to 1 (Wu et al., 16 Apr 2025).
Classic BLAS fusion doubles AI in BiCGK and GEMVER sequences and achieves up to 2.6× speedups over CUBLAS (Filipovič et al., 2013).
Stencil code fusions can raise AI from 0.04 to 0.12 flops/byte (3×) and produce 4× speedups on multi-GPU runs (Yadav et al., 2024).

5. Resource Constraints and Trade-Offs

Fusion is constrained by on-chip storage (registers, shared memory, DSM), register and shared-memory pressure (impacting occupancy), and the need for synchronizations (e.g., after partial reductions). Over-fusing can reduce SM occupancy causing diminished latency hiding (Zhang et al., 27 Jun 2025, Zheng et al., 2020, Amoros et al., 9 Aug 2025). For deep fusion chains or data-dependent kernels, codegen/resource limits or launch cost can dominate, requiring heuristics or analytic penalties in cost models to avoid over-fusion (Huang et al., 15 Dec 2025, Long et al., 2019). Fusing memory-bound operators with compute-bound ones generally realizes the most AI gain (Niu et al., 2021). DSM enables fusion at scales previously limited by SMEM, e.g., in multi-GEMM FFN blocks or large convolution chains (Huang et al., 15 Dec 2025).

6. Variants and Extensions Across Domains

Fusion is implemented in various programming models:

Automatic compile-time fusion via metaprogramming (FKL, C++17) enables arbitrary-depth fusion for point-wise, reduction, and batched kernels (Amoros et al., 9 Aug 2025).
JIT-IR fusion (MLIR, runtime fusion in distributed/task-based systems) allows cross-library kernel composition, as in Diffuse (Yadav et al., 2024).
Domain-specific generator frameworks (e.g., for PDEs or spectral operators) optimize fusion and storage contraction in nested loop and tensor-product settings, raising AI in stencil/PDE codes (Sewall et al., 2017, Trojak et al., 2021).
Video/data pipelines demonstrate 2–3× AI improvements and throughput gains for multi-stage fused image-processing (Adnan et al., 2015).

The convergence of tiling, DAG abstraction, resource-aware planning, and cost modeling is central to all high-performance fusion systems.

7. Empirical Benchmarks and Roofline Analysis

Empirical results across diverse domains systematically corroborate the core AI/speedup relationship:

Batched GEMM chains (MCFuser): up to 5.9× kernel speedup over Ansor, up to 8.1× over PyTorch or 3.0× over FlashAttention (Zhang et al., 27 Jun 2025).
FNO Fourier layers (TurboFNO): up to 1.5× boost over cuFFT+cuBLAS baselines, entirely attributable to elimination of 4–6 global memory passes (Wu et al., 16 Apr 2025).
Video pipelines: 67% reduction in global memory traffic and up to 3× increase in arithmetic intensity, mapping to doubled or tripled throughput (Adnan et al., 2015).
Diffusion/PDE codes: fusion raises intensity and moves from bandwidth-bound to compute-limited, with strong correlation between predicted AI and measured GFLOP/s (Sewall et al., 2017, Trojak et al., 2021).
Distributed tasks (Diffuse): raising AI by 2–5× produces up to 10× end-to-end speedups on GPU clusters (Yadav et al., 2024).

A consistent finding is that AI uplift, enabled by fusion, correlates strongly with performance in memory-constrained regimes.

References:

"MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators" (Zhang et al., 27 Jun 2025)
"FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads" (Long et al., 2019)
"Optimizing CUDA Code By Kernel Fusion---Application on BLAS" (Filipovič et al., 2013)
"DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion" (Niu et al., 2021)
"Composing Distributed Computations Through Task and Kernel Fusion" (Yadav et al., 2024)
"The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries" (Amoros et al., 9 Aug 2025)
"FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads" (Zheng et al., 2020)
"High-Performance Code Generation though Fusion and Vectorization" (Sewall et al., 2017)
"Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs" (Adnan et al., 2015)
"FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core Connection" (Huang et al., 15 Dec 2025)
"TurboFNO: High-Performance Fourier Neural Operator with Fused FFT-GEMM-iFFT on GPU" (Wu et al., 16 Apr 2025)
"Hyperbolic Diffusion in Flux Reconstruction: Optimisation through Kernel Fusion within Tensor-Product Elements" (Trojak et al., 2021)
"Kernel Fusion in Atomistic Spin Dynamics Simulations on Nvidia GPUs using Tensor Core" (Chen et al., 2023)