Fused Kernel Operations

Updated 23 September 2025

Fused kernel operations are the systematic integration of multiple GPU kernels into a single executable to improve memory locality and minimize global memory traffic.
Automatic fusion methodologies decompose kernels into load, compute, and store phases, utilizing cost modeling to achieve significant speedups, sometimes up to 3x in performance.
This approach leverages on-chip resources like registers and shared memory while balancing tradeoffs between resource consumption and throughput in various computational workloads.

Fused kernel operations refer to the systematic combination of multiple GPU computational kernels into a single kernel in order to increase memory locality, reduce global memory traffic, minimize kernel launch overhead, exploit on-chip memory (registers/shared memory), and—more generally—unlock higher fractions of device throughput, particularly in bandwidth-bound settings or operator sequences with nontrivial data reuse. Starting from foundational GPU code optimization themes, research in this area has substantively advanced both the theory and practice of automated kernel fusion, as well as the supporting compiler infrastructure and performance modelling, for a range of workloads including BLAS routines, image/video pipelines, high-performance scientific code, deep learning workloads, and distributed machine learning settings.

1. Principles of Kernel Fusion

At the core of kernel fusion is the observation that, compared to arithmetic throughput, GPU global memory bandwidth is severely limited. Kernel sequences (particularly those expressible as map, reduce, or their nested combinations) incur substantial global memory traffic when each kernel reads its entire input and writes its output to global memory before subsequent kernels reuse that data. By fusing kernels, intermediate results are kept on-chip as long as thread-to-data assignments and dependencies allow—either in registers (where the same thread consumes the output) or in shared memory (with necessary local barriers to manage intra-block data reuse).

This abstraction extends not just to unary operations (map), but also to reduction patterns and nested constructs. For example, a map of reductions or a reduction following a map can be fused, as long as their dependency pattern allows per-thread-block partial reductions and avoidance of global synchronization (Filipovič et al., 2013). In image applications, fusion encompasses chains of spatial and per-pixel operations, subject to thread and data dependency constraints (Adnan et al., 2015).

2. Automatic Fusion Methodologies

Automatic kernel fusion systems typically span several phases:

Decomposition and Partitioning: Each kernel is eg. decomposed into load, compute, and store routines with explicit resource/memory requirement metadata. A dependency graph is constructed for a sequence of operations (e.g., those corresponding to BLAS calls). Candidate fusible sub-graphs representing legal fusions preserving program semantics are identified (Filipovič et al., 2013).
Implementation Patterns: Kernels that operate with the same thread-to-data mapping allow retention of data in registers; otherwise, shared memory is used and local barriers are introduced (e.g., __syncthreads() in CUDA). For reduction patterns, partial sums are computed within blocks before a final reduction stage (Filipovič et al., 2013, Adnan et al., 2015).
Fusion Plan Exploration and Cost Modelling: The configuration space of possible fusions (including order, block size, memory/pipeline schedule) is traversed either exhaustively or using heuristics. Each candidate is scored by a predictive cost or latency model that benchmarks per-routine bandwidth, resource use, and estimates the overall time given occupancy and memory constraints. Empirically, predicted and observed optima align strongly, though some nonoptimal plans fall to 30–40% of the best (Filipovič et al., 2013).
Code Generation: The fused kernel is synthesized by gluing routine sequences, allocating intermediate data to registers or shared memory according to the computed mapping, and generating kernel launch parameters accordingly. Source-to-source compilers and DSLs automate this process (Filipovič et al., 2013, Adnan et al., 2015, Sewall et al., 2017).

3. Memory Locality and Resource Tradeoffs

The principal quantitative benefit of fusion is improved temporal locality. For example, in fused GEMVER, the number of global memory transfers is reduced by a factor proportional to the number of eliminated intermediate operations: CUBLAS achieves 31.9 GFlop/s while the fused version attains 83.4 GFlop/s—a 2.61x speedup (Filipovič et al., 2013). Similar advantages are seen in image/video sequences, where bandwidth reductions of 33-44% yield 2-3x execution time improvements (Adnan et al., 2015). Fused kernels typically utilize >75% of peak memory bandwidth of the GPU, indicating that the bottleneck shifts close to the memory system's theoretical limits.

However, fusing too many operations into a single kernel may increase register and shared memory usage, reducing occupancy (the number of simultaneously active warps or thread blocks). This can partially offset memory gains. Automated approaches thus model these resource constraints and either avoid oversubscription or balance tradeoffs in the cost model (Filipovič et al., 2013, Sewall et al., 2017).

4. Applications to BLAS and Scientific Kernels

Numerous BLAS-1 and BLAS-2 routines are naturally reformulated as fusion-friendly sequences. A dot product is decomposed as a map (element-wise multiplication) followed by a reduce (sum). Matrix–vector multiplications (e.g., y = Ax) follow nested map-reduce forms, which are partitioned into tiles to fit on-chip resources.

Compiler-driven fusion allows for the adjacent execution of AX = Ap and AS = Aᵗr (where one uses A by rows, another by columns), exploiting shared tiles loaded in shared memory for both operations without writing intermediate results to global memory (Filipovič et al., 2013). Redundant loads, stores, and barriers are minimized.

In scientific computing, similar graph-based and vectorization fusion paradigms merge multiple stencil, normalization, or hydrodynamic kernels into a single loop nest, which reduces intermediate storage and unlocks better vectorization—fusing nine kernels in Hydro2D and achieving dramatic bandwidth savings (Sewall et al., 2017).

5. Fusion in Image and Video Pipelines

Fusion in image/video domains often involves steps with complex thread/multi-thread dependencies, where care must be taken with memory tiling and synchronizations. The fusion transformation typically:

Copies an input region (“box”) from global memory to shared memory,
Converts global memory accesses within kernel segments to shared memory,
Concatenates code segments with inserted synchronization for thread-to-multithread dependencies,
Writes the final result back to global memory (Adnan et al., 2015).

The partitioning of kernel sequences is formally described as a constrained optimization problem, where E = Σᵢ XᵢCᵢ (summed cost of each kernel partition) is minimized subject to partitions covering the full pipeline (Adnan et al., 2015). Resulting fused kernels in feature tracking attain up to 3x throughput improvement over non-fused pipelines, and are critical for real-time neuroscience and industrial vision workloads.

6. Limitations and Performance Modeling

Although kernel fusion addresses bandwidth limitations and reduces unnecessary data movement, it must be carefully adapted to cases with complex data sharing and dependency patterns. If data is accessed by other threads than those which produced it, shared memory is required, and local synchronization is essential. Disparate routines (with different thread/block mappings) may inhibit fusion or lead to higher overhead in management (Filipovič et al., 2013).

Automated compilers deploy performance modeling—often based on routine microbenchmarks and static resource estimation—to guide the fusion process. While most generated candidates match or nearly match empirically optimal implementations (often >98% for top candidates), suboptimal plans remain, which underscores the need for robust cost models and heuristics for practical deployment (Filipovič et al., 2013).

7. Implications, Research Impact, and Extensions

The compiler-based methodologies introduced in these works fundamentally advanced the field by enabling automatic generation of fused kernels for map/reduce/nested workloads, broad BLAS routines, and scientific pipelines. By systematically restructuring code to retain shared data on-chip, these approaches move kernel performance closer to theoretical bandwidth ceilings, open up new avenues for composite operator specialization, and circumvent some limitations of even highly tuned vendor libraries.

Extensions of these ideas, including support for mixed reduction/broadcast pipelines, advanced cost models, domain-specialized scheduling (as in image/video or scientific computing), and integration with high-level computational DSLs, further demonstrate the continuing evolution and importance of principled, automated fused kernel operation methodologies in high-performance GPU programming and domain-specific compiler development.

PDF Markdown Chat (Pro)

References (3)

Optimizing CUDA Code By Kernel Fusion---Application on BLAS (2013)

Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs (2015)

High-Performance Code Generation though Fusion and Vectorization (2017)

Follow Topic

Get notified by email when new papers are published related to Fused Kernel Operations.