Architecture-Aware Kernel Fusion

Updated 21 April 2026

Architecture-aware kernel fusion is a method that combines multiple GPU kernels into a single kernel to mitigate memory traffic and launch overhead while enhancing performance.
It leverages detailed hardware parameters such as registers, shared memory, and occupancy to drive optimized fusion strategies and precise code generation.
Advanced compiler techniques, including ILP-guided planning and JIT compilation, enable significant speedups and effective resource balancing on modern GPUs.

Architecture-aware kernel fusion refers to the systematic combination of multiple GPU kernels—either sequentially dependent or independent—into a single, larger fused kernel while precisely tuning the mapping of computations to the target hardware’s resources (registers, shared memory, threadblocks, warps, data movement hierarchies). The primary goals are to reduce off-chip memory traffic, minimize kernel-launch overheads, maximize parallelism, and balance resource pressure for sustained high throughput. Architecture awareness distinguishes modern fusion strategies, as fusion decisions and code generation are driven by observed or predicted characteristics of the specific GPU microarchitecture (e.g., registers/block, shared memory, SM count, scheduler behavior, on-chip interconnect capabilities), rather than static or hardware-agnostic heuristics. This enables substantial speedups over both naive kernel scheduling and more basic fusion approaches and is increasingly important as hardware complexity and compute/memory imbalance grows.

1. Fundamentals of Kernel Fusion: Terminology and Variants

The term "kernel fusion" traditionally encompasses two main orthogonal approaches:

Vertical Fusion (VF): Sequentially dependent kernels (producer–consumer pairs, e.g., elementwise + reduction) are merged so that a single thread computes all operations in series for its assigned data partition, bypassing global round-trips for intermediates. This reduces off-chip memory traffic and kernel launch overhead but is constrained by per-thread resource limits and must respect synchronization barriers (Li et al., 2020, Zhang et al., 12 Feb 2026).
Horizontal Fusion (HF): Multiple independent or weakly-coupled kernels are mapped to disjoint thread-index ranges within a block or across blocks in a single kernel launch. This exposes richer thread-level parallelism and allows the warp scheduler to interleave instructions from kernels with differing resource demands (e.g., memory-bound and compute-bound), thus masking instruction latency more effectively. HF can trade some block-level parallelism for increased latency tolerance and occupancy (Li et al., 2020, Amoros et al., 9 Aug 2025).
Generalized Fusion: Modern frameworks blend these approaches, supporting arbitrarily complex fusion patterns, including nested or batched kernels, fusing deep operator graphs (e.g., GEMM + pointwise + reduction), and leveraging novel hardware resources such as distributed shared memory (DSM) or hierarchical caches (Huang et al., 15 Dec 2025, Zhang et al., 12 Feb 2026).

These approaches differ not only in their idiomatic code structures but in their architectural implications: vertical fusion is constrained by resource aggregation per thread, while horizontal fusion is bounded by total block/thread-grid occupancy.

2. Architecture Awareness: Modeling and Resource Constraints

Architecture-aware fusion is predicated on explicit modeling of the GPU’s crucial architectural parameters and bottlenecks. The fusion framework typically reasons about:

Registers per thread/block ( $N_{\text{Regs}}$ ): Determines how many "in-flight" instructions, accumulator arrays, or persistent buffers can be maintained without spilling to slower memory, directly impacting kernel occupancy and performance. Automatic register capping or allocation strategies are used to search for optimal values within hardware limits (Li et al., 2020, Huang et al., 15 Dec 2025).
Shared memory per block ( $SMShMem$ ): Used for storing intermediate computation tiles, synchronization, or staging buffer for kernel segments. Fusion plans are pruned or tuned if the aggregate shared memory use of a candidate fused kernel would exceed device limits (Huang et al., 15 Dec 2025, Zheng et al., 2020).
Occupancy ( $O$ ): Defined as the number of active warps or blocks per SM relative to hardware maximum, computed as $O = \min(\lfloor R_{\max}/R_{\mathrm{used}} \rfloor,\lfloor S_{\max}/S_{\mathrm{used}} \rfloor,\lfloor T_{\max}/T_{\mathrm{blk}} \rfloor)/\text{max\_warps\_per\_SM}$ (Zheng et al., 2020). Fusion candidates are required to maintain high occupancy unless resource overcommit is justified by increased instruction-level parallelism.
Instruction latency and issue-slot utilization: Architecture-aware horizontal fusion aims to compose kernels with complementary resource profiles, such that warp/subgroup schedulers can interleave waiting and active instructions to maximize utilization—measured empirically as issue slot utilization (e.g., via nvprof) (Li et al., 2020).
On-chip interconnects and DSM: On newer GPUs (e.g., NVIDIA Hopper), large-scale fusion leverages distributed shared memory for cluster-wide collectives (e.g., all-reduce, shuffle, reduce-scatter) to accommodate larger intermediate tiles and reduce HBM traffic, demanding DSM-aware cost models and communication primitives (Huang et al., 15 Dec 2025).

Mathematically, optimal fusion is cast as a search problem: maximizing a gain function (latency or throughput improvement) subject to hardware constraints, often formalized as ILP or combinatorial optimization (Long et al., 2019, Zheng et al., 2020).

3. Methodologies: Compiler Techniques and Fusion Algorithms

Recent frameworks provide a variety of fusion planning and code generation techniques:

ILP-guided fusion planning: FusionStitching and related works decompose the operator DAG into candidate fusion patterns (subgraphs), assign each a gain model (memory savings, launch reductions, execution time), and solve an integer linear program (ILP) to maximize total gain under resource and acyclicity constraints (Long et al., 2019, Adnan et al., 2015). Heuristics prune the fusion search space, focusing on large patterns or empirically useful combinations.
Automatic code generation with hardware modeling: Compilers such as HFuse or Fused Kernel Library (FKL) analyze individual kernel resource footprints, generate fused kernels for different thread/block partitions, and empirically profile them to select the best configuration. Horizontal fusion is realized by partitioning block thread-ids into disjoint sets—guarded branches per kernel body—and partial barrier synchronization via inline PTX (Li et al., 2020, Amoros et al., 9 Aug 2025).
Template metaprogramming and JIT: Libraries like FKL use C++17 template metaprogramming to emit fully inlined, fused global kernels where resource constraints are enforced at compile time, yielding a single specialized kernel per application, batch, or operator sequence. The framework can combine horizontal and vertical fusion via batch-dimension mapping and per-thread register allocation, with static-assert guards against exceeding per-SM limits (Amoros et al., 9 Aug 2025).
DSM-aware search and execution planning: On architectures with DSM, fusion frameworks (e.g., FlashFuser) formalize inter-block communication as explicit primitives, quantify data-movement at each memory hierarchy level, and search for tile/schedule/group parameters that minimize end-to-end latency subject to cache and DSM bandwidth/capacity constraints (Huang et al., 15 Dec 2025).
Resource-use analysis and code generation: For each generated fusion plan, resource allocation is planned at register, shared memory, and (where necessary) global memory levels, applying buffer re-use and live-range analysis to minimize peak usage (Zheng et al., 2020, Trojak et al., 2021).
Cost models and empirical tuning: Model-driven approaches rely on roofline arithmetic intensity estimates, empirical performance data (microbenchmarks), and cache or bandwidth measurements to determine when further fusion is beneficial versus detrimental due to occupancy or cache pressure (Filipovič et al., 2013, Zhang et al., 12 Feb 2026).

4. Performance Outcomes and Empirical Results

In practice, architecture-aware kernel fusion yields substantial quantitative benefits:

Speedups: Depending on workload and hardware, reported speedups range widely:
- Horizontal fusion via HFuse: 2.5% to 60.8%, with "mixed" memory-compute pairs exceeding 30% and cases of up to 65.8% with register capping (Li et al., 2020).
- Fused Kernel Library: achieves geometric-mean speedups from 2× to over 1,000× on compute/memory-bound microbenchmarks; real-world CV pipelines show 2×–200× acceleration (Amoros et al., 9 Aug 2025).
- FusionStitching: up to 5.7× over Tensorflow baseline and 1.4× over prior state-of-the-art (Long et al., 2019).
- FlashFuser (DSM-aware): reduces global memory traffic by 58% and achieves up to 4.1× kernel speedup, 1.24× end-to-end, on transformer and convolutional chains (Huang et al., 15 Dec 2025).
- Domain-specific applications (e.g., hyperbolic diffusion): measured 3–4× kernel speedup; 2.3× end-to-end acceleration for 3D CFD (Trojak et al., 2021).
- Deep fusion in transformer MLPs: up to 13.2% kernel-level speedup; 5–10% end-to-end decoding improvement (Zhang et al., 12 Feb 2026, Bikshandi et al., 2023).
Off-chip traffic reduction: Fused kernels entirely eliminate redundant intermediate reads/writes, which is especially valuable for memory-bound workloads or architectures with high FLOPS/B ratios, and further improves L1/L2 cache hit rates (Huang et al., 15 Dec 2025, Zhang et al., 12 Feb 2026).
Occupancy vs. resource pressure: The best thread-space partitions and register caps vary significantly per architecture, demonstrating the necessity for hardware-specific tuning. Fusion plans that overload shared or register resources may reduce block-level parallelism and, if not carefully profiled, cause regressions, especially in compute- or memory-bound homogeneous kernels (Li et al., 2020, Long et al., 2019).
Algorithmic implications: On modern hardware, the maximum achievable fusion speedup is typically bounded by the ratio of off-chip memory operations (pre/post-fusion) and the device’s ability to overlap communication with computation.

5. Extensions and Advanced Topics

Modern architecture-aware fusion frameworks are extending beyond basic elementwise/reduction/few-GEMM patterns to encompass:

Large-scale fusion across memory hierarchy boundaries: With DSM (e.g., NVIDIA H100), fusion can now combine clusters of kernels whose intermediate data exceeds the capacity of a single SM’s shared memory, using inter-core collectives for cache-resident operator chains (Huang et al., 15 Dec 2025).
Integration with domain-specific compilers and deep learning systems: FusionStitching and related techniques have been deployed in production on AI clusters, automating the discovery and deployment of optimal fusion plans for complex NN operator graphs (Zheng et al., 2020).
Hierarchical and graph-based planning: Fusion candidates are selected via dynamic programming, beam search, or subgraph enumeration, guided by empirical cost models and hardware capability queries (Long et al., 2019, Zheng et al., 2020, Zhang et al., 12 Feb 2026).
Template-based and declarative fusion: Libraries enabling high-level composition of GPU operators with automatic architecture-aware fusion during compilation deliver high performance and usability across diverse application domains (Amoros et al., 9 Aug 2025, Sewall et al., 2017).
Advanced synchronization: Fine-grained, partial barriers and cooperative groups are used in fused kernels to preserve correct ordering without serializing independent work, further exposing instruction-level parallelism (Li et al., 2020, Huang et al., 15 Dec 2025).

6. Representative Algorithmic and Mathematical Formulations

Key mathematical and pseudocode constructs from the literature include:

Occupancy constraint for partitioned horizontal fusion (Li et al., 2020):

$b_1 = \left\lfloor\frac{SMNRegs}{d_1 \cdot NRegs(K_1)}\right\rfloor ;\quad b_2 = \left\lfloor\frac{SMNRegs}{d_2 \cdot NRegs(K_2)}\right\rfloor ;\quad b_0 = \min\left( b_1, b_2, \left\lfloor\frac{SMShMem}{ShMem(F)}\right\rfloor, \left\lfloor\frac{SMNThreads}{d_0}\right\rfloor \right)$

where $d_0 = d_1 + d_2$ . Register cap: $r_0 = \left\lfloor\frac{SMNRegs}{b_0 \cdot d_0}\right\rfloor$ .

ILP for fusion planning (Long et al., 2019):

$\begin{aligned} \text{maximize}\quad & \sum_{j=1}^{k} X_j\,f(P_j) \ \text{subject to}\quad & X_u + X_v \leq 1 \quad \forall\,P_u \cap P_v \neq \varnothing \ & \text{cycle cuts as needed} \ & X_j \in \{0,1\} \end{aligned}$

DSM data movement and cost model (Huang et al., 15 Dec 2025):

$T_{\text{comm}}(X,K) = T_{\text{lat}}(K) + X/B_{\text{dsm}}(K)$

Overall objective (minimax): $\min_{s, t} \max(C_{\text{comp}}, \max_{\ell} C_\ell )$ , subject to $SMShMem$ 0 for all $SMShMem$ 1.

Memory-traffic reduction for deep fused transformer MLPs (Zhang et al., 12 Feb 2026): Naïve 4-kernel: $SMShMem$ 2; after fusion: $SMShMem$ 3; reduction $SMShMem$ 4.

7. Limitations and Current Frontiers

While architecture-aware fusion frameworks now dominate the optimization of both compute- and memory-bound GPU workloads, challenges persist:

Complexity of search space: The combinatorial nature of fusion possibilities across deep operator graphs and diverse resource constraints makes global optimality computationally difficult. In practice, heuristics, pattern-based generators, and local profile-guided search are essential (Long et al., 2019, Zheng et al., 2020).
Heterogeneity of workloads: Not all pairs of kernels are fusible: when both are highly compute-bound, the aggregate pressure can collapse occupancy and negate gains. Conversely, fusing a compute-bound with a memory-bound kernel is often highly profitable (Li et al., 2020).
Portability: While template-based and declarative fusion abstractions raise portability, specialized tuning per architecture (e.g., tile size, threadblock shape) remains crucial for optimality (Amoros et al., 9 Aug 2025, Bikshandi et al., 2023).
Synchronization and dependencies: Boundary conditions, reductions, and large-scale inter-block dependencies may require auxiliary communication/synchronization constructs, which can limit achievable fusion, especially across operator types or when global reductions are involved (Li et al., 2020, Huang et al., 15 Dec 2025, Adnan et al., 2015).

In summary, architecture-aware kernel fusion encompasses a suite of code transformation, planning, and code generation strategies that maximize the efficiency of GPU computing by aligning fusion granularity and scheduling with the hardware’s detailed resource structure. The result is both broad and deep algorithmic acceleration across modern DNNs, scientific HPC, and real-time signal processing domains, with demonstrated speedups spanning from tens of percent to orders of magnitude relative to conventional, non-fused baseline executions (Li et al., 2020, Huang et al., 15 Dec 2025, Long et al., 2019, Amoros et al., 9 Aug 2025, Zhang et al., 12 Feb 2026).