Kernel Fusion in GPU Computing

Updated 8 October 2025

Kernel Fusion is the process of consolidating multiple GPU kernels into a single composite kernel that minimizes global memory traffic and kernel launch overhead.
It leverages on-chip memory to cache intermediate data, reducing redundant global memory fetches and delivering substantial speedups for routines such as BLAS and iterative solvers.
Advanced compiler techniques automate fusion strategies by analyzing dependency graphs and resource constraints, thereby optimizing memory locality and hardware utilization.

Kernel fusion refers to the programmatic consolidation of two or more computational kernels—typically GPU or accelerator kernels—into a single, composite kernel, with the primary objective of minimizing global memory traffic, kernel launch overhead, and intermediate temporary storage. By maintaining shared or intermediate data in on-chip memory (registers or shared memory), kernel fusion directly addresses the bandwidth bottleneck prevalent in memory-bound computations such as BLAS and iterative solvers. Automated kernel fusion compilers and advanced fusion strategies further extend these benefits, yielding substantial speedups, improved memory locality, and optimal GPU resource utilization across a broad class of numerical and scientific computations.

1. Rationale and Principles of Kernel Fusion

Kernel fusion is motivated by the observation that the arithmetic throughput of modern GPUs is often significantly underutilized due to bandwidth limitations imposed by off-chip global memory transfers. Many GPU workloads (e.g., in linear algebra, signal processing, and scientific simulation) exhibit kernel sequences with significant data dependencies but limited computational intensity per kernel. In traditional execution, each kernel stages intermediate results through global memory, incurring the cost of redundant memory fetch/store cycles and synchronization via kernel launches.

Fusing such dependent kernels enables direct reuse of data within the on-chip memory hierarchy. For instance, consider an operation decomposed as $z = f_3(f_1(x), f_2(y))$ : in a naive implementation $f_1(x)$ and $f_2(y)$ are materialized in global memory before being read by $f_3$ , while in the fused version, their outputs are cached on-chip until needed. This leads to dramatic reductions in data traffic, especially when the intermediates are large and computation per element is modest.

Conceptually, kernel fusion can often be formalized as the composition of "data-parallel" primitives such as map and reduce. Using higher-order notation, a BLAS-1 dot product is

$\text{DOT}(x, y) = \text{reduce}(+, \text{map}(·, x, y))$

which is readily implemented as a single fused kernel holding the partial products in registers or shared memory.

2. Application to BLAS and GPU Linear Algebra

The impact of kernel fusion is particularly clear in the context of BLAS routines on GPUs as detailed in "Optimizing CUDA Code By Kernel Fusion—Application on BLAS" (Filipovič et al., 2013). The sequential nature of BLAS-1 (vector-vector) and BLAS-2 (matrix-vector) routines typically involves several memory-bound steps—vector loads, elementwise compute, reductions, etc.—each implemented as a distinct kernel. For the dot product, the unfused path writes out the result of the elementwise product before summing, while the fused version accumulates partial sums on-chip.

In BLAS-2, such as SGEMV ( $y = A x$ ), a full BLAS library call may need to fetch a row or column of $A$ from memory multiple times across operations, unless fusion is applied. In advanced patterns like BiCGK, which require both $A p$ and $A^T r$ in succession, a fused kernel can share the matrix tile across both operations:

// Pseudocode of fused BiCGK
allocate shared: A_tile, p_local, r_local
for each tile {
    load A_tile, p_local, r_local from global memory
    // Both q = A p and s = A^T r computed here, with A_tile reused
    compute q and s with parallel reductions
    store q and s
}

These fusion patterns reduce global memory reads/writes, leading to higher utilization of available memory bandwidth and improved arithmetic intensity.

3. Source-to-Source Compiler Techniques

Automating the fusion of map/reduce operations and their combinations is non-trivial, as explored through a source-to-source compiler in (Filipovič et al., 2013). The compiler's principal components are:

Elementary function library: Routines are encoded in a special CUDA form, with metadata expressing their thread/data mapping and barrier requirements.
Dependency graph construction: The compiler builds a graph representing data flow and identifies fusible subgraphs, matching compatible thread mappings and local barrier needs.
Fusion policy exploration and performance modeling: The compiler enumerates possible fusion strategies, choosing parameters (block size, serial iteration count) using empirical performance models (cost = computation time + data transfer time).
Code generation: Each identified fusible subgraph is "glued" into a single GPU kernel, with global memory accesses minimized and shared memory re-use maximized. Local synchronization primitives (e.g., __syncthreads()) are inserted as required.

For example, the fused $y = A x$ kernel generated by the compiler combines coalesced loads from global memory into shared memory staging buffers, followed by an in-shared-memory matrix-vector multiply and final global memory write-back:

__global__ void fused_sgemv_kernel(...) {
    __shared__ float s_A[...], s_x[...];
    // Load tile into s_A and s_x
    __syncthreads();
    // Compute result tilewise
    float result = 0.0;
    for (int k = 0; k < tile_size; ++k) {
        result += s_A[...] * s_x[k];
    }
    y[...] = result;
}

4. Performance Implications and Empirical Results

Fused kernels generated by this approach achieve notable speedups over CUBLAS and hand-optimized multi-kernel baselines, especially for memory-bound, bandwidth-limited workloads. Reported empirical metrics include:

BLAS Sequence	Speedup over CUBLAS
GEMVER	up to 2.61×
BiCGK	≈1.61×
AXPYDOT (BLAS-1)	nearly 2×

Fused implementations in these cases attained >75% of theoretical memory bandwidth for the GPU. The performance benefits derive both from reduced global memory accesses and amortized kernel launch overhead. In resource-bound scenarios, fusion can turn a sequence of memory-bound operations into a pipeline that approaches bandwidth saturation.

5. Implementation Trade-offs and Limitations

While the observed benefits of kernel fusion are significant, there are non-trivial trade-offs:

Shared memory and register pressure: Fusing multiple routines increases on-chip memory demands, which may limit occupancy or restrict block size.
Synchronization: When thread-to-data mappings differ between routines, local barriers or partial synchronizations must be introduced, which can introduce additional overhead or serialize parts of the computation.
Fusion granularity: Aggressive fusion of many routines can hit hardware resource ceilings and, paradoxically, reduce performance due to reduced parallel block scheduling.
Code complexity/maintainability: Automated source-to-source approaches mitigate this, but hand-crafted fused kernels remain complex to tune.

The compiler-based approach mitigates some of these issues by performing static analysis, simulation, and tuning at code generation time, selecting fusion strategies that balance data locality against hardware limitations.

6. Theoretical Formulations

Key formalizations employed in kernel fusion for map/reduce/fold operations include:

$\begin{aligned} &\text{map}\left(f, L_1, ..., L_n\right) = \big[ f(e_1^1, ..., e_1^n), ..., f(e_m^1, ..., e_m^n) \big] \ &\text{reduce}(+, \text{map}(·, x, y)) \tag{dot product} \ &y = \text{map} (\text{reduce}(+, \text{map}(·, A_i, x)), A) \tag{matrix-vector product} \ \end{aligned}$

These higher-order representations support compiler-level reasoning about when and how fusion can be applied; they are further exemplified in the compiler's dependency analysis and fusion policies.

7. Broader Impact and Future Directions

The demonstrated compiler-driven kernel fusion methodology establishes the feasibility of systematically fusing map, reduce, and their combinations for BLAS and similar workloads. The resulting fused kernels match or outperform vendor libraries on memory-bound sequences. The approach generalizes to many data-parallel numeric applications where memory locality can be improved by staged computation, provided that thread-to-data mapping and synchronization can be resolved.

Future directions include:

Extending the fusion strategy for even more irregular computation or higher-order kernels (beyond map/reduce), such as those arising in machine learning workloads.
Integrating hardware-awareness: adapting fusion strategies for hardware with hierarchical memory, software-managed caches, and complex synchronization models.
Combining fusion with auto-tuning and dynamic scheduling to handle inputs or workloads whose optimal fusion policy varies at runtime.

These advances have practical implications for high-level domain-specific compilers and libraries, further closing the gap between theoretical arithmetic peak and achieved performance for memory-bound scientific computation.

PDF Markdown Chat (Pro)

References (1)

Optimizing CUDA Code By Kernel Fusion---Application on BLAS (2013)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Kernel Fusion.