Kernel Fusion in GPU Computing
- Kernel Fusion is the process of consolidating multiple GPU kernels into a single composite kernel that minimizes global memory traffic and kernel launch overhead.
- It leverages on-chip memory to cache intermediate data, reducing redundant global memory fetches and delivering substantial speedups for routines such as BLAS and iterative solvers.
- Advanced compiler techniques automate fusion strategies by analyzing dependency graphs and resource constraints, thereby optimizing memory locality and hardware utilization.
Kernel fusion refers to the programmatic consolidation of two or more computational kernels—typically GPU or accelerator kernels—into a single, composite kernel, with the primary objective of minimizing global memory traffic, kernel launch overhead, and intermediate temporary storage. By maintaining shared or intermediate data in on-chip memory (registers or shared memory), kernel fusion directly addresses the bandwidth bottleneck prevalent in memory-bound computations such as BLAS and iterative solvers. Automated kernel fusion compilers and advanced fusion strategies further extend these benefits, yielding substantial speedups, improved memory locality, and optimal GPU resource utilization across a broad class of numerical and scientific computations.
1. Rationale and Principles of Kernel Fusion
Kernel fusion is motivated by the observation that the arithmetic throughput of modern GPUs is often significantly underutilized due to bandwidth limitations imposed by off-chip global memory transfers. Many GPU workloads (e.g., in linear algebra, signal processing, and scientific simulation) exhibit kernel sequences with significant data dependencies but limited computational intensity per kernel. In traditional execution, each kernel stages intermediate results through global memory, incurring the cost of redundant memory fetch/store cycles and synchronization via kernel launches.
Fusing such dependent kernels enables direct reuse of data within the on-chip memory hierarchy. For instance, consider an operation decomposed as : in a naive implementation and are materialized in global memory before being read by , while in the fused version, their outputs are cached on-chip until needed. This leads to dramatic reductions in data traffic, especially when the intermediates are large and computation per element is modest.
Conceptually, kernel fusion can often be formalized as the composition of "data-parallel" primitives such as map and reduce. Using higher-order notation, a BLAS-1 dot product is
which is readily implemented as a single fused kernel holding the partial products in registers or shared memory.
2. Application to BLAS and GPU Linear Algebra
The impact of kernel fusion is particularly clear in the context of BLAS routines on GPUs as detailed in "Optimizing CUDA Code By Kernel Fusion—Application on BLAS" (Filipovič et al., 2013). The sequential nature of BLAS-1 (vector-vector) and BLAS-2 (matrix-vector) routines typically involves several memory-bound steps—vector loads, elementwise compute, reductions, etc.—each implemented as a distinct kernel. For the dot product, the unfused path writes out the result of the elementwise product before summing, while the fused version accumulates partial sums on-chip.
In BLAS-2, such as SGEMV (), a full BLAS library call may need to fetch a row or column of from memory multiple times across operations, unless fusion is applied. In advanced patterns like BiCGK, which require both and in succession, a fused kernel can share the matrix tile across both operations:
1 2 3 4 5 6 7 8 |
// Pseudocode of fused BiCGK allocate shared: A_tile, p_local, r_local for each tile { load A_tile, p_local, r_local from global memory // Both q = A p and s = A^T r computed here, with A_tile reused compute q and s with parallel reductions store q and s } |
3. Source-to-Source Compiler Techniques
Automating the fusion of map/reduce operations and their combinations is non-trivial, as explored through a source-to-source compiler in (Filipovič et al., 2013). The compiler's principal components are:
- Elementary function library: Routines are encoded in a special CUDA form, with metadata expressing their thread/data mapping and barrier requirements.
- Dependency graph construction: The compiler builds a graph representing data flow and identifies fusible subgraphs, matching compatible thread mappings and local barrier needs.
- Fusion policy exploration and performance modeling: The compiler enumerates possible fusion strategies, choosing parameters (block size, serial iteration count) using empirical performance models (cost = computation time + data transfer time).
- Code generation: Each identified fusible subgraph is "glued" into a single GPU kernel, with global memory accesses minimized and shared memory re-use maximized. Local synchronization primitives (e.g., __syncthreads()) are inserted as required.
For example, the fused kernel generated by the compiler combines coalesced loads from global memory into shared memory staging buffers, followed by an in-shared-memory matrix-vector multiply and final global memory write-back:
1 2 3 4 5 6 7 8 9 10 11 |
__global__ void fused_sgemv_kernel(...) { __shared__ float s_A[...], s_x[...]; // Load tile into s_A and s_x __syncthreads(); // Compute result tilewise float result = 0.0; for (int k = 0; k < tile_size; ++k) { result += s_A[...] * s_x[k]; } y[...] = result; } |
4. Performance Implications and Empirical Results
Fused kernels generated by this approach achieve notable speedups over CUBLAS and hand-optimized multi-kernel baselines, especially for memory-bound, bandwidth-limited workloads. Reported empirical metrics include:
BLAS Sequence | Speedup over CUBLAS |
---|---|
GEMVER | up to 2.61× |
BiCGK | ≈1.61× |
AXPYDOT (BLAS-1) | nearly 2× |
Fused implementations in these cases attained >75% of theoretical memory bandwidth for the GPU. The performance benefits derive both from reduced global memory accesses and amortized kernel launch overhead. In resource-bound scenarios, fusion can turn a sequence of memory-bound operations into a pipeline that approaches bandwidth saturation.
5. Implementation Trade-offs and Limitations
While the observed benefits of kernel fusion are significant, there are non-trivial trade-offs:
- Shared memory and register pressure: Fusing multiple routines increases on-chip memory demands, which may limit occupancy or restrict block size.
- Synchronization: When thread-to-data mappings differ between routines, local barriers or partial synchronizations must be introduced, which can introduce additional overhead or serialize parts of the computation.
- Fusion granularity: Aggressive fusion of many routines can hit hardware resource ceilings and, paradoxically, reduce performance due to reduced parallel block scheduling.
- Code complexity/maintainability: Automated source-to-source approaches mitigate this, but hand-crafted fused kernels remain complex to tune.
The compiler-based approach mitigates some of these issues by performing static analysis, simulation, and tuning at code generation time, selecting fusion strategies that balance data locality against hardware limitations.
6. Theoretical Formulations
Key formalizations employed in kernel fusion for map/reduce/fold operations include:
$\begin{aligned} &\text{map}\left(f, L_1, ..., L_n\right) = \big[ f(e_1^1, ..., e_1^n), ..., f(e_m^1, ..., e_m^n) \big] \ &\text{reduce}(+, \text{map}(·, x, y)) \tag{dot product} \ &y = \text{map} (\text{reduce}(+, \text{map}(·, A_i, x)), A) \tag{matrix-vector product} \ \end{aligned}$
These higher-order representations support compiler-level reasoning about when and how fusion can be applied; they are further exemplified in the compiler's dependency analysis and fusion policies.
7. Broader Impact and Future Directions
The demonstrated compiler-driven kernel fusion methodology establishes the feasibility of systematically fusing map, reduce, and their combinations for BLAS and similar workloads. The resulting fused kernels match or outperform vendor libraries on memory-bound sequences. The approach generalizes to many data-parallel numeric applications where memory locality can be improved by staged computation, provided that thread-to-data mapping and synchronization can be resolved.
Future directions include:
- Extending the fusion strategy for even more irregular computation or higher-order kernels (beyond map/reduce), such as those arising in machine learning workloads.
- Integrating hardware-awareness: adapting fusion strategies for hardware with hierarchical memory, software-managed caches, and complex synchronization models.
- Combining fusion with auto-tuning and dynamic scheduling to handle inputs or workloads whose optimal fusion policy varies at runtime.
These advances have practical implications for high-level domain-specific compilers and libraries, further closing the gap between theoretical arithmetic peak and achieved performance for memory-bound scientific computation.