Automated Kernel Fusion Compilers

Updated 7 December 2025

Automated kernel fusion compilers are systems that transform multiple GPU kernel calls into a single optimized kernel by leveraging static metaprogramming, polyhedral transformation, and JIT methodologies.
They enhance device utilization by reducing data movement, improving memory locality, and optimizing resource management through vertical and horizontal fusion models.
These compilers achieve significant speedups in deep learning and scientific computing by applying cost-aware schedule selection and dynamic synchronization while managing register and shared-memory constraints.

Automated kernel fusion compilers perform the automatic transformation of a sequence or collection of GPU kernel calls into a smaller set—or ideally a single—highly optimized kernel. This optimization improves device utilization, enhances memory locality, and reduces both data-movement and kernel-launch overheads across a range of workloads, particularly in deep learning, scientific computing, and high-throughput data analytics. Recent compilers advance beyond template-based or hand-crafted approaches by exploiting static metaprogramming, polyhedral transformation, domain-specific intermediate representations, and autotuning or analytical models to deliver on-demand, hardware-aware fusion for both pointwise, reduction-heavy, and memory/computation-bound operators. These systems, illustrated by frameworks such as The Fused Kernel Library (FKL), FusionStitching, Neptune, and others, represent the present state-of-the-art in automated kernel fusion for GPUs (Amoros et al., 9 Aug 2025, Zheng et al., 2020, Zhao et al., 9 Oct 2025).

1. Abstraction Principles and Fusion Interfaces

Modern automated kernel fusion compilers decompose the problem into layers: high-level composable abstractions, static or JIT-activated fusion logic, and code generation. FKL, for instance, introduces three abstraction layers:

Ops: Zero-state C++ structs representing atomic memory (MemOp) or compute (ComputeOp) operations, each with a pure device-side exec(...) method operating on non-pointer PODs destined for register allocation.
Instantiable Ops (IOp): Lightweight wrappers that bind parameters at runtime, supporting varying arities and enclosing shape or pointer metadata.
Data-Parallel Patterns (DPP): High-level skeletons (e.g., TransformDPP) that compose chains of IOps into a single global kernel, ensuring pure register flow for intermediate data.

Other systems, such as Neptune, employ a schedule-driven, operator DAG-based IR in which fusion opportunities are made explicit via user or template-guided transformations on loop nests, reduction patterns, and elementwise chains (Zhao et al., 9 Oct 2025). In frameworks like FusionStitching, the input graph is partitioned into potential fusion candidates based on DAG analysis and then grouped into patterns amenable to codegen and memory reuse optimization (Zheng et al., 2020).

2. Compile-Time and JIT Fusion Methodologies

Approaches to fusion span the static–dynamic continuum:

Static Metaprogramming: FKL exclusively uses C++17 template metaprogramming and type-traits to build a compile-time chain of exec evaluations, statically unrolling kernel logic along user-defined sequences. This leverages recursive template expansion, static_for-style reflection, and type-dispatch constructs to achieve full fusion at compile time, with no custom compiler or runtime reflection (Amoros et al., 9 Aug 2025).
JIT and Dynamic Codegen: Compilers like FusionStitching (Zheng et al., 2020) or Neptune (Zhao et al., 9 Oct 2025) operate in a dynamic environment where fusion decisions depend on runtime shape or dependency analysis. FusionStitching, for example, employs an approximate dynamic-programming explorer and beam search to extract optimal patterns, then generates CUDA or LLVM IR via code templates and schedule enumeration.

Both strategies implement cost-aware schedule selection, legality checks on shared memory/register pressure, and insert any required explicit synchronization (e.g., via inline PTX bar.sync) as dictated by communication or parallelism constraints (Li et al., 2020).

3. Horizontal and Vertical Fusion Models

Automated kenerl fusion recognizes two main forms:

Vertical Fusion (VF): Classic sequential chaining, where outputs of one kernel feed the next, enabling removal of intermediate device-global memory accesses. FKL's TransformDPP chains Arbitrary sequences of IOps such that all intermediates persist in registers, not DRAM, until the terminal write.
Horizontal Fusion (HF): Merges independent or weakly coupled kernels by interleaving their thread spaces within a kernel launch. Systems such as HFuse (Li et al., 2020) and FKL's BatchRead/BatchWrite pattern generate composite blocks where, for example, blockIdx.z indexes independent work items. This concurrently raises thread-level parallelism and can hide instruction- or memory-bound stalls by sharing SM resources.

The synergy between vertical and horizontal fusion enables, for example, FKL to process up to 50 batches of 10,000 Mul+Add operations with speedups up to 20,900×, as measured against back-to-back launches on current NVIDIA hardware (Amoros et al., 9 Aug 2025). Automatic configuration of launch parameters is derived at compile time (via constexpr grid/block inferencing); horizontal fusion can target optimal Z-dimension batching until architectural limits are reached.

4. Memory Locality, Register Pressure, and Fusion Constraints

A defining property of automated fusion is its impact on memory–register locality:

Register-Resident Intermediates: By structuring fusion such that IOps only pass POD values in arguments/returns, intermediates remain on-chip, eliminating global memory spill except at input/output boundaries.
Absence of Intermediate Buffers: In fused chains, the DRAM traffic amount is effectively reduced from $K$ reads/writes in $K$ kernels to 1 read and 1 write. The speedup is thus $S_{\max} \approx K$ , subject to compute or other bottlenecks (Amoros et al., 9 Aug 2025).
Register/Shared-Memory Constraints: The total registers or SHMEM per thread/block must not exceed device limits. Compilers statically or dynamically estimate the sum across all fused operators, and, if exceeded, fallback to smaller fusion units to avoid spilling and performance regression.

FusionStitching extends the model to multiple levels of data reuse: thread-local (register), warp-level (register-shuffle), and block-level (shared memory), with the code generator selecting among them by dominance-tree lifetime analysis and occupancy-aware cost modeling (Zheng et al., 2020).

5. Fusion Decision Algorithms and Code Generation

Automated fusion compilers implement multi-stage plans:

Pattern Enumeration and Cost Modeling: For any fusion candidate (subgraph), the compiler considers the legal orderings, evaluates predicted register/shared memory consumption, and scores via models incorporating memory traffic, compute time, launch overhead, and occupancy (using occupancy as $\min\{\ldots\}$ across limiting factors). For example, FKL relies on compile-time NVCC estimation; FusionStitching and MCFuser integrate analytical or low-overhead cost models (Amoros et al., 9 Aug 2025, Zheng et al., 2020, Zhang et al., 27 Jun 2025).
Code Emission: FKL emits a single global C++ kernel for each fused pattern; HFuse emits code generated via AST transformation passes that guard and branch kernel-specific bodies according to thread/block indices (Li et al., 2020). JIT-based compilers (e.g., FusionStitching, Neptune) lower fused DAGs into LLVM IR/PTX or CUDA C, binding memory and thread mappings per pattern.

Legal fusion is determined both by data dependence (no global memory barrier required; no across-thread or cross-operator race), shape compatibility, and satisfaction of per-kernel resource budgets. Thread-local, shared, and global synchronizations are inserted only as needed for correctness.

6. Performance Evaluation and Empirical Results

Empirical studies on automated kernel fusion compilers demonstrate significant speedups across a variety of hardware and workload types:

System	Workloads	Peak Speedup	Average Speedup	Notable Outcomes
Fused Kernel Library (FKL) (Amoros et al., 9 Aug 2025)	Mul+Add chains, image batches	20,900×	2×–1,000×	Up to 50×–200× for single/double precision
FusionStitching (Zheng et al., 2020)	BERT, DIEN, Transformer, etc.	2.21×	1.45×	7,000 GPU-hr/month saved in prod. cluster
HFuse (Li et al., 2020)	Deep-learning, crypto	60.8%	up to 2×	Best when fusing memory + compute-bound
DLFusion (Liu et al., 2020)	ONNX DNNs (ResNet, VGG, etc.)	7.9×	—	Speedup close to oracle, tuning in ms
Neptune (Zhao et al., 9 Oct 2025)	Self-attention, reductions	1.35×	35% over FlashAttention	Cross-GPU, reduction-tiled fusion
MCFuser (Zhang et al., 27 Jun 2025)	GEMM chains, BERT, attention	5.9×	—	70× tuning speedup over Ansor

The systems operate at or near the memory–bandwidth roofline, with performance gains being maximized when fusion eliminates global memory round trips for bandwidth-bound workloads. In compute-bound settings, fusion is limited by increased register pressure and the potential onset of streaming-multiprocessor occupancy limitations. Diminishing or negative returns are observed when over-fusing compute-heavy operators, emphasizing the importance of analytical or empirical models for fusion planning.

7. Limitations, Extensions, and Future Directions

Automated kernel fusion compilers report several known limitations:

Resource Budget Overflows: Excessive vertical fusion may breach register or shared-memory budgets, causing spills and loss of performance (Amoros et al., 9 Aug 2025, Zheng et al., 2020).
Complex Pattern Coverage: Most production compilers currently implement only a subset of supported parallel patterns (e.g., TransformDPP in FKL, reduction-centric fusion in Neptune). Generalization to arbitrary DAGs or mixed parallelism (e.g., convolutions, n-ary reductions, irregular sparse operators) requires additional custom DPPs or IR transformations (Amoros et al., 9 Aug 2025, Zhao et al., 9 Oct 2025, Cheshmi et al., 2021).
Shape and Static Analysis Requirements: Some systems, notably Neptune, require compile-time shape knowledge for maximal fusion; dynamic-shape JITs and polyhedral compilers aim to relax this but may incur large compile times or code size (Zhao et al., 9 Oct 2025, Zheng et al., 2020).
Hardware Portability: Although methods generalize in principle, analytical and regression-based cost models must be retuned to fit device-specific register/shared memory budgets, tile sizes, and performance accelerator features (e.g., tensor cores) (Zhang et al., 27 Jun 2025, Bhaskaracharya et al., 2020).
Autotuning Overheads: Despite efforts like MCFuser’s analytical search, exhaustive fusion and schedule tuning over high-dimensional parameter spaces can remain compute-intensive for large operator graphs; evolutionary or hierarchical search mitigates but does not eliminate these costs (Zhang et al., 27 Jun 2025).
Challenging Operator Types: Full cross-operator fusion for reduction-heavy, loop-carried, or non-homogeneous parallelism (e.g., softmax, prefix-sum, sparse iterators) requires advanced algebraic correction or dependency-breaking techniques, as demonstrated in Neptune (Zhao et al., 9 Oct 2025).

Ongoing and future research targets modular DPP libraries, improved cross-DAG fusion, dynamic cost model integration, generalized dependency analysis (for sparse/irregular dataflows), and hardware-agnostic portability layers.

Automated kernel fusion compilers, exemplified by FKL and related systems, have fundamentally altered best practices for GPU code generation: enabling transparent, on-demand, and near-optimal fusion for complex operator pipelines while exposing minimal manual complexity to library users and significantly raising hardware efficiency (Amoros et al., 9 Aug 2025, Zheng et al., 2020, Li et al., 2020, Zhao et al., 9 Oct 2025).