Horizontal Fusion for GPU Optimization

Updated 16 December 2025

Horizontal Fusion is a GPU optimization technique that aggregates independent kernel invocations into a single launch to reduce overhead and boost thread-level parallelism.
It is implemented using compile-time C++ metaprogramming and source-to-source transformations that restructure thread indexing and synchronization for maximal efficiency.
Empirical studies show that HF can achieve speedups from 1x to 66x for batched workloads, and even greater performance when combined with vertical fusion.

Horizontal Fusion (HF) is a GPU optimization technique that aggregates multiple independent invocations of a kernel, typically across different data planes or batch elements, into a single, larger kernel launch. Unlike traditional kernel fusion, which stitches dependent kernel operations in sequence (vertical fusion), HF seeks to increase thread-level parallelism, minimize kernel launch overhead, and optimize memory bandwidth by leveraging the hardware’s ability to interleave independent instruction streams. HF is implemented both through source-to-source compiler transformations and compile-time metaprogramming, and has demonstrated substantial performance improvements for workloads characterized by independent, repeated kernel executions (Amoros et al., 9 Aug 2025, Li et al., 2020).

1. Contrasting Horizontal Fusion and Vertical Fusion

Vertical Fusion (VF) combines a sequence of dependent data-parallel operations such that intermediate results are retained in on-chip memory (SRAM) rather than round-tripping through DRAM. In practice, VF replaces:

1	launch(Op₁); launch(Op₂); ... launch(Opₙ);

with

1	launch(fused Op₁+Op₂+...+Opₙ);

thereby reducing DRAM traffic from

n

round-trips to a single memory access (Amoros et al., 9 Aug 2025).

Horizontal Fusion, by contrast, aggregates $B$ independent invocations of the same kernel $\mathit{Op}$ , each operating on different data “planes” or batch elements:

1 2	for i in [0..B−1]: launch(Op, data[i])

is fused to

1	launch(fusedBatch<Op,B>)

Here, the third grid dimension (

blockIdx.z

) is used to multiplex

B

kernel operations in a single launch, with each thread block operating on its assigned batch element independently (Amoros et al., 9 Aug 2025, Li et al., 2020).

This distinction is summarized below:

Technique	Data Dependency	Kernel Launch Pattern	Resource Optimization
Vertical Fusion	Dependent	Sequential	Intermediate data in registers; DRAM round-trips eliminated
Horizontal Fusion	Independent	Parallel/Batched	Kernel launch overhead minimized; memory access coalescing; increased parallelism

2. Implementation Methodologies

a) Compile-Time C++17 Template Composition

The Fused Kernel Library (Amoros et al., 9 Aug 2025) implements HF via reusable, compile-time fusionable components in C++17. Kernels are constructed using abstractions:

MOp/COp: Memory and compute operations, as connectable building blocks.
IOp: Instantiable operations carrying per-kernel or batched parameters.
DPP: Data-parallel pattern for thread organization.

HF is realized through templates such as:

template <typename Mop, size_t B>
struct BatchRead {
    Mop::ParamsType _ps[B];
    __device__ Mop::OutputType exec(Point t) const {
        return Mop::exec(t, _ps[t.z]);
    }
};

The thread organization uses the grid's third dimension ( $blockIdx.z$ ) to allocate a unique batch plane to each kernel instance. All operations are resolved statically; the actual device code (SASS) contains no dispatch logic, resulting in maximal efficiency.

b) Source-to-Source Compiler Transformation

HFuse (Li et al., 2020) implements HF by partitioning thread blocks into intervals, each corresponding to a distinct kernel. The transformation:

Defines a fused kernel with $blockDim_{fused} = d_1 + d_2$ for kernels $K_1$ and $K_2$ .
Rewrites thread indexing and block dimensions so that $tid < d_1$ executes $K_1$ and $tid \ge d_1$ executes $K_2$ .
Rewrites synchronization barriers to ensure thread-subset correctness using PTX "bar.sync" instructions.
Empirically profiles candidate configurations for optimal thread splits and register allocations.

This approach increases the number of eligible warps for the hardware’s warp scheduler, enabling efficient latency hiding through interleaving of instruction streams.

3. Performance Models and Experimental Results

The theoretical and observed benefits of HF are quantified as follows:

Let $B$ be the batch size (number of planes).

Unfused sequential execution:

$T_{seq}(B) = B \cdot [T_{launch} + T_{mem} + T_{compute}]$

Horizontally fused execution:

$T_{HF}(B) \approx T_{launch} + [T_{mem} + T_{compute}]$

Ideal speedup:

$S(B) = \frac{T_{seq}(B)}{T_{HF}(B)} \approx B$ (until DRAM bandwidth is saturated) (Amoros et al., 9 Aug 2025).

Empirical findings demonstrate:

HF-only delivers speedups from $1\times$ up to $66\times$ for batch sizes $1 \leq B \leq 1191$ (60×120 images).
Combined VF+HF yields speedups of up to $20\,931\times$ on an RTX 4090 compared to OpenCV-CUDA (single pipeline), and up to $2\,527\times$ compared to CUDA Graphs (Amoros et al., 9 Aug 2025).
HFuse measured speedups from $2.5\%$ to $60.8\%$ , with maximum gains for kernel pairs characterized by complementary resource utilization (e.g., memory-bound fused with compute-bound) (Li et al., 2020).

4. Algorithmic Structures and Pseudocode

Fused Kernel Library utilizes compile-time recursion on variadic template instantiations. For horizontal fusion, the core pseudocode for batch reads is:

template<typename Mop, size_t B>
struct BatchRead {
    Mop::ParamsType _ps[B];
    __device__ Mop::OutputType exec(Point t) const {
        return Mop::exec(t, _ps[t.z]);
    }
};

The host prepares arrays of parameter records, instantiates the fused read and write ops, specifies dim3 grid(..., ..., B), and launches the kernel. Each device thread dispatches based on its own thread.z (Amoros et al., 9 Aug 2025).

HFuse partitions block dimensions, substitutes thread indices, and guards thread intervals:

1
2
3

for each tid in [0, d₁ + d₂):
    if (tid < d₁) { run S₁ }
    else          { run S₂ }

Barriers and synchronization are rewritten using inline PTX to ensure correctness of intra-kernel and inter-kernel thread interactions (Li et al., 2020).

5. Practical Applications and Design Trade-Offs

HF is especially impactful for batched workloads common in deep learning and signal processing, where numerous independent instances of the same operation must be applied (e.g., batch image processing, elementwise transforms).

Key advantages:

Kernel launch overhead is eliminated for each batch element, reducing PCIe and launch latency.
Memory bandwidth utilization approaches theoretical peak as independent memory streams are coalesced.
Intermediate buffers are eliminated from DRAM, reducing overall memory requirements, particularly at increased resolutions, e.g., $~259$  KB saved for 60×120 images (Amoros et al., 9 Aug 2025).

Trade-offs and limitations:

Register usage per-thread may limit occupancy; profiling is required to balance spill vs. parallelism.
Branch divergence is introduced from guarded thread intervals, though impact is minimal if block sizes are warp-aligned.
Barrier rewriting relies on features like PTX “bar.sync,” which may not be portable across future ISAs (Li et al., 2020).
Maximum achievable speedup is bounded by hardware memory bandwidth and on-chip resource constraints.

6. Extensions, Integration, and Future Directions

Suggestions for maximizing HF utility include:

Integrating HF strategies as passes in existing ML compilers (TVM, XLA, Glow, TASO).
Extending multi-interval strategies beyond two kernels, to $n$ -way partitioned blocks.
Combining HF with vertical fusion and ND-range or cooperative-group techniques for broader fusion granularity.
Leveraging automatic profiling to assess register consumption and shared-memory utilization, ensuring optimal occupancy and minimizing spill.

Evidence suggests that HF can be made robust, fully automated, and synergistic with existing cost-based GPU optimization frameworks (Amoros et al., 9 Aug 2025, Li et al., 2020).

7. Summary and Implications

Horizontal Fusion provides a complementary dimension to traditional vertical fusion by exploiting thread-level parallelism across independent kernel invocations. Realizations such as the Fused Kernel Library and HFuse demonstrate that the approach is feasible with both compile-time metaprogramming (requiring no custom compilers) and source-to-source compilation. HF is particularly effective for batched GPU workloads and accelerates compute pipelines wherever kernel launch overhead and memory bandwidth constraints are limiting factors. Its combination with vertical fusion leads to multiplicative performance gains and large reductions in intermediate memory footprint. Further integration into high-level compilers and the expansion to more dynamic workloads remains a significant area for development (Amoros et al., 9 Aug 2025, Li et al., 2020).

PDF Markdown Chat (Pro)

References (2)

The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries (2025)

Automatic Horizontal Fusion for GPU Kernels (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Horizontal Fusion (HF).