Papers
Topics
Authors
Recent
2000 character limit reached

Macro-Kernel Fusion: Techniques & Impact

Updated 14 February 2026
  • Macro-kernel fusion is a technique that fuses multiple compute kernels into a single composite kernel to reduce memory overhead and enhance performance in high-performance computing workloads.
  • It minimizes off-chip memory traffic by maximizing on-chip data reuse and reducing kernel-launch overhead, which benefits GPU and deep learning applications.
  • Advanced compiler analyses and resource-aware scheduling enable macro-kernel fusion to efficiently handle complex operator chains and distributed workflows.

Macro-kernel fusion refers to the systematic transformation of multiple computational kernels—especially in GPU, deep learning, and high-performance computing workloads—into a single, composite “macro-kernel.” This composite kernel executes fused computational stages with minimal off-chip memory traffic, maximizing in-core data reuse and reducing kernel-launch overhead. Macro-kernel fusion exploits both hardware-level features (on-chip memory hierarchies and inter-core collective communication) and advanced compiler analyses to go beyond classic “micro-kernel” fusion, enabling end-to-end fusing of large operator chains, complex computational graphs, and even distributed workflows.

1. Principles and Motivation

Macro-kernel fusion targets the chronic memory-bandwidth bottleneck in modern computational architectures, where computation throughput outpaces memory subsystem improvements. By fusing operations, it minimizes global memory transactions and maximizes use of registers, scratchpad (shared) memory, and inter-core communication channels.

Key objectives:

  • Locality maximization: Intermediate results are held in registers or on-chip buffers, avoiding DRAM round-trips.
  • Kernel-launch minimization: Fewer launches reduce PCIe/API overhead and enforce persistent computation paths.
  • Exploitation of advanced on-chip resources: On architectures like NVIDIA Hopper (H100), distributed shared memory (DSM) spanning multiple SMs enables fusion beyond single-SM buffer limits (Huang et al., 15 Dec 2025).
  • Programmability: Automated or template-based fusion methods allow general users and library developers to realize these gains without manual kernel engineering (Amoros et al., 9 Aug 2025, Filipovič et al., 2013).

2. Algorithms and Architectures

2.1. Fusion Abstractions and Patterns

Two fundamental patterns are recognized:

  • Vertical Fusion (VF): Sequentially dependent operations (e.g., chained BLAS1/2 kernels, computation pipelines) are fused so that each data element flows through the entire pipeline in a single pass, accumulating all transformations before final storage (Amoros et al., 9 Aug 2025, Filipovič et al., 2013).
  • Horizontal Fusion (HF): Multiple independent invocations of the same kernel (e.g., batched calls over disjoint data) are grouped into a single launch, maximizing occupancy and DRAM throughput (Amoros et al., 9 Aug 2025).

Fusion frameworks construct explicit or implicit representations to formalize fusibility:

  • DAG and dataflow graphs: Operator dependencies are encoded in directed acyclic graphs, supporting dependency analysis, fusibility checks, and partitioning strategies (Filipovič et al., 2013, Sewall et al., 2017).
  • Cluster/task graphs: For massive or persistent kernels, SM-granularity task graphs with explicit event dependencies enable end-to-end fusion at the mega-kernel scale (Cheng et al., 22 Dec 2025).
  • Intermediate Representations (IR): Systems for distributed fusion (e.g., Diffuse) model both distributed data and computation symbolically, allowing fusion across tasks and libraries without per-node materialization (Yadav et al., 2024).

2.3. Fusion Algorithmic Steps

Fusing kernels requires:

3. Key Techniques in Recent Frameworks

Framework Fusion Domain Key Innovation
FlashFuser (Huang et al., 15 Dec 2025) Deep learning, GEMM DSM-based collectives, tile-based analysis
MCFuser (Zhang et al., 27 Jun 2025) Memory-bound operator chains Exhaustive tile search + DAG hoisting
Fused Kernel Library (Amoros et al., 9 Aug 2025) C++ GPU libraries Compile-time VF/HF metaprogramming
Diffuse (Yadav et al., 2024) Distributed/stateless tasks IR-driven multi-task fusion, MLIR codegen
TGX/MPK (Cheng et al., 22 Dec 2025) Multi-SM persistent kernels SM-level task/event graphs, in-kernel scheduling

FlashFuser expands the scale of feasible fusion by modeling SM clusters with DSM-backed communication abstractions (all-reduce, shuffle, reduce-scatter) and unifying resource mapping across tiles, fusing multi-GEMM chains previously impossible due to scratchpad limits (Huang et al., 15 Dec 2025). MCFuser systematically builds and prunes fusion search spaces using tiling expressions, DAG analysis, and analytical models, aggressively fusing MBCI chains (Zhang et al., 27 Jun 2025). The Fused Kernel Library’s compile-time approach facilitates on-demand, type-safe fusion for arbitrary operation chains with precise resource modeling (Amoros et al., 9 Aug 2025). TGX/MPK generalizes macro-kernel fusion to persistent, distributed mega-kernels whose scheduling, pipeline overlap, and dependency management are resolved purely intra-kernel (Cheng et al., 22 Dec 2025). Diffuse applies macro-kernel fusion to distributed, task-based programming via scale-free distributed IR, enabling massive kernel- and task-fusion across both library and function boundaries (Yadav et al., 2024).

4. Empirical Impact and Performance

Macro-kernel fusion often yields multi-fold improvements in both raw kernel performance and end-to-end throughput.

  • FlashFuser (Huang et al., 15 Dec 2025): On NVIDIA H100, reduces DRAM traffic by 58%, delivers up to 4.1× kernel speedup over state-of-the-art compilers, and achieves 1.24× end-to-end speedup on LLM workloads.
  • MCFuser (Zhang et al., 27 Jun 2025): On NVIDIA A100/RTX3080, achieves up to 5.9× kernel speedup over Ansor and reduces tuning time by up to 139×; BERT inference gains average 1.45×.
  • Fused Kernel Library (Amoros et al., 9 Aug 2025): Delivers up to 185× via vertical fusion, 66× via horizontal fusion, and over 20,000× for combined macro-kernels on high-FLOP/Byte hardware. Dramatic reductions in CPU-side overhead also observed.
  • HFAV (Sewall et al., 2017): 2–4× speedups for bandwidth-bound nested loops compared to auto-vectorized code, and competitive with hand-tuned routines.
  • Diffuse (Yadav et al., 2024): 1.86× geometric mean application-level speedup across up to 128 GPUs, with cases (Black–Scholes) exceeding 10×.

Performance gains are tightly coupled to memory-bound regime prevalence, on-chip resource utilization, and the efficacy of analytical or empirical pruning within the fusion planner.

5. Implementation Constraints and Limitations

Macro-kernel fusion is governed by several intrinsic constraints:

  • On-chip resource limits: Excessive fusion can exhaust registers, shared/DSM allocation, or increase code size, reducing occupancy and potentially negating benefits (Amoros et al., 9 Aug 2025, Huang et al., 15 Dec 2025).
  • Fusibility constraints: Dependency types (e.g., fan-out, non-pointwise reduction), required synchronization, or dataflow shape may prevent legal fusion (Adnan et al., 2015, Yadav et al., 2024).
  • Hardware specificity: Techniques exploiting DSM, task-level persistent scheduling, or advanced collectives may not generalize across GPU architectures or require fallback variants (Huang et al., 15 Dec 2025, Cheng et al., 22 Dec 2025).
  • Complexity in distributed settings: Large-scale or multi-library distributed workloads require careful analysis of partitioning, symbolic dependencies, and communication steps to avoid introducing illegal data races or excessive synchronization (Yadav et al., 2024).
  • Algorithmic domain specificity: Many high-performance implementations are tailored or most effective for linear operator chains, tensor contractions, and specific “hot path” dataflows (Zhang et al., 27 Jun 2025, Sewall et al., 2017).

6. Applications and Broader Implications

Macro-kernel fusion has expanded the scope of what is feasible in on-chip pipeline design for deep learning, scientific simulation, and massive distributed analytics.

  • Deep learning operators: Multi-GEMM, FFN, and attention module fusion with in-core accumulation is now routine in LLM and transformer inference (Huang et al., 15 Dec 2025, Zhang et al., 27 Jun 2025).
  • Stencils and PDE solvers: Fusion of flux evaluations, divergence assembly, and update steps in a single pass yields near-roofline performance in numerical simulation codes (Trojak et al., 2021, Sewall et al., 2017).
  • Sparse and iterative solvers: Pipelined macros reduce host-device barriers and redundant loads/stores, accelerating small-to-medium system solves (Rupp et al., 2014).
  • Distributed and persistent workflows: Automated task/kernel fusion in systems like Diffuse and MPK reduces launch overheads and improves end-to-end resource utilization, enabling high-level languages and library design to compete with optimized MPI (Yadav et al., 2024, Cheng et al., 22 Dec 2025).
  • AutoML and feature learning: Deep learning architectures apply macro-fusion to learned kernel composition and fusion, as in multiple-kernel learning and network regularization (Song et al., 2016).

7. Methodological Guidelines and Best Practices

Best practices for macro-kernel fusion derived from empirical and algorithmic studies include:

Macro-kernel fusion synthesizes compiler theory, system-level resource modeling, and domain-specialized algorithmic design to enable scalable, high-performance data-locality across the entire stack, shifting workloads from memory-bound to compute-bound and narrowing the gap to hardware limits.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Macro-Kernel Fusion.