Horizontal Fusion for GPU Optimization
- Horizontal Fusion is a GPU optimization technique that aggregates independent kernel invocations into a single launch to reduce overhead and boost thread-level parallelism.
- It is implemented using compile-time C++ metaprogramming and source-to-source transformations that restructure thread indexing and synchronization for maximal efficiency.
- Empirical studies show that HF can achieve speedups from 1x to 66x for batched workloads, and even greater performance when combined with vertical fusion.
Horizontal Fusion (HF) is a GPU optimization technique that aggregates multiple independent invocations of a kernel, typically across different data planes or batch elements, into a single, larger kernel launch. Unlike traditional kernel fusion, which stitches dependent kernel operations in sequence (vertical fusion), HF seeks to increase thread-level parallelism, minimize kernel launch overhead, and optimize memory bandwidth by leveraging the hardware’s ability to interleave independent instruction streams. HF is implemented both through source-to-source compiler transformations and compile-time metaprogramming, and has demonstrated substantial performance improvements for workloads characterized by independent, repeated kernel executions (Amoros et al., 9 Aug 2025, Li et al., 2020).
1. Contrasting Horizontal Fusion and Vertical Fusion
Vertical Fusion (VF) combines a sequence of dependent data-parallel operations such that intermediate results are retained in on-chip memory (SRAM) rather than round-tripping through DRAM. In practice, VF replaces:
1 |
launch(Op₁); launch(Op₂); ... launch(Opₙ); |
1 |
launch(fused Op₁+Op₂+...+Opₙ); |
Horizontal Fusion, by contrast, aggregates independent invocations of the same kernel , each operating on different data “planes” or batch elements:
1 2 |
for i in [0..B−1]:
launch(Op, data[i]) |
1 |
launch(fusedBatch<Op,B>) |
This distinction is summarized below:
| Technique | Data Dependency | Kernel Launch Pattern | Resource Optimization |
|---|---|---|---|
| Vertical Fusion | Dependent | Sequential | Intermediate data in registers; DRAM round-trips eliminated |
| Horizontal Fusion | Independent | Parallel/Batched | Kernel launch overhead minimized; memory access coalescing; increased parallelism |
2. Implementation Methodologies
a) Compile-Time C++17 Template Composition
The Fused Kernel Library (Amoros et al., 9 Aug 2025) implements HF via reusable, compile-time fusionable components in C++17. Kernels are constructed using abstractions:
- MOp/COp: Memory and compute operations, as connectable building blocks.
- IOp: Instantiable operations carrying per-kernel or batched parameters.
- DPP: Data-parallel pattern for thread organization.
HF is realized through templates such as:
1 2 3 4 5 6 7 |
template <typename Mop, size_t B> struct BatchRead { Mop::ParamsType _ps[B]; __device__ Mop::OutputType exec(Point t) const { return Mop::exec(t, _ps[t.z]); } }; |
The thread organization uses the grid's third dimension () to allocate a unique batch plane to each kernel instance. All operations are resolved statically; the actual device code (SASS) contains no dispatch logic, resulting in maximal efficiency.
b) Source-to-Source Compiler Transformation
HFuse (Li et al., 2020) implements HF by partitioning thread blocks into intervals, each corresponding to a distinct kernel. The transformation:
- Defines a fused kernel with for kernels and .
- Rewrites thread indexing and block dimensions so that executes and executes .
- Rewrites synchronization barriers to ensure thread-subset correctness using PTX "bar.sync" instructions.
- Empirically profiles candidate configurations for optimal thread splits and register allocations.
This approach increases the number of eligible warps for the hardware’s warp scheduler, enabling efficient latency hiding through interleaving of instruction streams.
3. Performance Models and Experimental Results
The theoretical and observed benefits of HF are quantified as follows:
Let be the batch size (number of planes).
- Unfused sequential execution:
- Horizontally fused execution:
- Ideal speedup:
(until DRAM bandwidth is saturated) (Amoros et al., 9 Aug 2025).
Empirical findings demonstrate:
- HF-only delivers speedups from up to for batch sizes (60×120 images).
- Combined VF+HF yields speedups of up to on an RTX 4090 compared to OpenCV-CUDA (single pipeline), and up to compared to CUDA Graphs (Amoros et al., 9 Aug 2025).
- HFuse measured speedups from to , with maximum gains for kernel pairs characterized by complementary resource utilization (e.g., memory-bound fused with compute-bound) (Li et al., 2020).
4. Algorithmic Structures and Pseudocode
Fused Kernel Library utilizes compile-time recursion on variadic template instantiations. For horizontal fusion, the core pseudocode for batch reads is:
1 2 3 4 5 6 7 |
template<typename Mop, size_t B> struct BatchRead { Mop::ParamsType _ps[B]; __device__ Mop::OutputType exec(Point t) const { return Mop::exec(t, _ps[t.z]); } }; |
dim3 grid(..., ..., B), and launches the kernel. Each device thread dispatches based on its own thread.z (Amoros et al., 9 Aug 2025).
HFuse partitions block dimensions, substitutes thread indices, and guards thread intervals:
1 2 3 |
for each tid in [0, d₁ + d₂):
if (tid < d₁) { run S₁ }
else { run S₂ } |
Barriers and synchronization are rewritten using inline PTX to ensure correctness of intra-kernel and inter-kernel thread interactions (Li et al., 2020).
5. Practical Applications and Design Trade-Offs
HF is especially impactful for batched workloads common in deep learning and signal processing, where numerous independent instances of the same operation must be applied (e.g., batch image processing, elementwise transforms).
Key advantages:
- Kernel launch overhead is eliminated for each batch element, reducing PCIe and launch latency.
- Memory bandwidth utilization approaches theoretical peak as independent memory streams are coalesced.
- Intermediate buffers are eliminated from DRAM, reducing overall memory requirements, particularly at increased resolutions, e.g., KB saved for 60×120 images (Amoros et al., 9 Aug 2025).
Trade-offs and limitations:
- Register usage per-thread may limit occupancy; profiling is required to balance spill vs. parallelism.
- Branch divergence is introduced from guarded thread intervals, though impact is minimal if block sizes are warp-aligned.
- Barrier rewriting relies on features like PTX “bar.sync,” which may not be portable across future ISAs (Li et al., 2020).
- Maximum achievable speedup is bounded by hardware memory bandwidth and on-chip resource constraints.
6. Extensions, Integration, and Future Directions
Suggestions for maximizing HF utility include:
- Integrating HF strategies as passes in existing ML compilers (TVM, XLA, Glow, TASO).
- Extending multi-interval strategies beyond two kernels, to -way partitioned blocks.
- Combining HF with vertical fusion and ND-range or cooperative-group techniques for broader fusion granularity.
- Leveraging automatic profiling to assess register consumption and shared-memory utilization, ensuring optimal occupancy and minimizing spill.
Evidence suggests that HF can be made robust, fully automated, and synergistic with existing cost-based GPU optimization frameworks (Amoros et al., 9 Aug 2025, Li et al., 2020).
7. Summary and Implications
Horizontal Fusion provides a complementary dimension to traditional vertical fusion by exploiting thread-level parallelism across independent kernel invocations. Realizations such as the Fused Kernel Library and HFuse demonstrate that the approach is feasible with both compile-time metaprogramming (requiring no custom compilers) and source-to-source compilation. HF is particularly effective for batched GPU workloads and accelerates compute pipelines wherever kernel launch overhead and memory bandwidth constraints are limiting factors. Its combination with vertical fusion leads to multiplicative performance gains and large reductions in intermediate memory footprint. Further integration into high-level compilers and the expansion to more dynamic workloads remains a significant area for development (Amoros et al., 9 Aug 2025, Li et al., 2020).