FlashFuser: DSM-Enhanced Kernel Fusion

Updated 22 December 2025

FlashFuser is a compiler framework that uses distributed shared memory (DSM) on modern GPUs to fuse compute-intensive deep learning operators.
It implements novel DSM communication primitives and a dataflow analyzer to minimize global memory traffic and expedite kernel execution.
The unified search engine prunes vast design spaces, achieving significant kernel speedups and reducing DRAM traffic in memory-bound workloads.

FlashFuser is a compiler framework that expands the scope of kernel fusion for compute-intensive deep learning operators by leveraging the Distributed Shared Memory (DSM) capabilities of contemporary GPUs such as the NVIDIA H100. Classic kernel fusion techniques are constrained by the limited scratchpad memory (SMEM) on each streaming multiprocessor (SM), causing fusion to fail for operators with large intermediate tensors (e.g., the two successive GEMMs in Transformer Feed-Forward Networks, FFNs). FlashFuser introduces a DSM-based communication abstraction, a dataflow analyzer generalized to the distributed memory hierarchy, and a unified analytical-plus-empirical search engine, which together enable substantial reductions in global memory access and yield significant kernel and end-to-end speedups in memory-bound workloads (Huang et al., 15 Dec 2025).

1. Hardware Context and Bottlenecks

The compute/memory bandwidth scaling disparity has created a pronounced memory wall on recent GPUs. The NVIDIA H100 features a peak FP16 throughput of approximately 1000 TFLOPS, contrasted with 3 TB/s HBM bandwidth, resulting in a compute-to-bandwidth improvement of 3.3× relative to the previous generation A100 (300 TFLOPS, 2 TB/s). This widening gap renders tensor-dense operators, such as the FFN components in large transformer models, memory-bound: at sequence length 512, FFN layers can account for 40%–60% of total inference time.

The H100 architecture introduces an L1.5 cache tier by interconnecting SMEM across SMs within a cluster, forming a DSM—a high-bandwidth, low-latency, on-chip memory pool accessible by all SMs in a cluster (up to 16 on H100). This augmentation enables higher aggregate scratchpad capacity per cluster than the 227 KB available on a single SM. However, legacy fusion frameworks (e.g., Chimera, BOLT, CUTLASS) are unable to exploit DSM, and abort fusion when intermediate data exceeds local SMEM. This behavior leads to frequent, expensive round-trips through global memory, observed as performance collapses for large-hidden-dimension FFN layers in models such as GPT-6.7B (Huang et al., 15 Dec 2025).

2. DSM-Based Communication Abstraction

FlashFuser formalizes a collective communication abstraction for DSM, encapsulating all intra-cluster data exchange required by fused GEMM chains and similar operator graphs. Three primitives are introduced:

dsm_all_exchange: Implements a general collective exchange, invoked after a GEMM reduction when cluster tiling parameter cls_k > 1. Each block holds a partial sum C^{(p)}_{i}; dsm_all_exchange reduces or multiplies these partials to aggregate the full tile, using $C_{i} = \bigoplus_{p} C^{(p)}_{i}$ with $\oplus$ as + or × contingent on the operator type.
dsm_shuffle: Redistributes matrix tiles from one block to many, enabling data realignment for subsequent fused GEMMs (e.g., mapping C-rows to blocks computing D×C).
dsm_reduce_scatter: After the second GEMM, executes a local reduction within shuffle groups, followed by a global atomic reduction via the Hopper TMA (cp.reduce.async.bulk) primitive if multi-cluster fusion is required.

Cluster and shuffle group formations are governed by parameters cls_shuffle = cls_l / cls_k and cls_reduce = (cls_n·cls_k) / cls_l. Empirical measurements of DSM bandwidth ( $B_{dsm}$ ) and latency ( $L_{dsm}$ ) by cluster size inform the cost models. These primitives are designed to achieve near-peak DSM bandwidth utilization (up to 80–90% for clusters of size ≤8) (Huang et al., 15 Dec 2025).

3. Dataflow Analyzer and Multi-Level Scheduling

The dataflow analyzer in FlashFuser generalizes tensor mapping and scheduling across the full device memory hierarchy: registers → SMEM → DSM → global memory. The analyzer operates on a fused subgraph $g$ with loop nest dimensions $X = \{x_0, \ldots, x_{J-1}\}$ (e.g., $M,N,K,L$ ), a candidate schedule $s$ (permutation of $X$ with spatial/temporal annotation), candidate tile sizes $t$ , and hardware-specific cache/bandwidth parameters.

The pseudo-algorithm systematically computes, for each tensor $T$ , its block footprint, identifies its placement and potential spilling across memory levels, and calculates corresponding data volumes $D_V = \{V_{reg}, V_{smem}, V_{dsm}, V_{global}\}$ . The central cost model for candidate plan evaluation is:

$C_l = \frac{V_l}{B_l}$

for all memory levels $l$ , and the objective is

$\min_{tile, schedule} \max_{l} (V_l / B_l)$

subject to per-level utilization $U_l(tile)$ not exceeding hardware capacity $Cap_l$ . This enables cost-predictive plan evaluation prior to compilation and hardware profiling.

Loop scheduling is unified via a single J-dimensional nest, with spatial dimensions distributed across CTAs and temporal dimensions resulting in intra-CTA sequential access. This facilitates flexible reordering (e.g., mnkl vs. mnlk), directly influencing tile spillage and data movement patterns (Huang et al., 15 Dec 2025).

4. Unified Search Engine and Design Space Pruning

The expansion of fusion into DSM significantly enlarges the configuration search space: for large FFN workloads, possible (cls_m, cls_n, cls_k, cls_l) tuples yield ≈2.75×10¹³ candidates. FlashFuser reduces the search dimensionality through domain-specific, rule-based pruning:

Pruning Step	Remaining Candidates
Raw search space	2.75×10¹³
Rule 1 applied	1.14×10⁸
+ Rule 2	2.47×10⁷
+ Rule 3	1.44×10⁷
+ Rule 4	9.62×10⁶
+ Rule 5	1.15×10⁶

Rules include exact division of tile sizes, cluster-resource constraints, innermost scheduling for activation dimensions, dependency satisfaction, and per-level capacity restrictiveness.
Following pruning, top-K cost-guided enumeration (with K = 11) identifies the lowest-latency candidates as predicted by the cost models; only these are compiled and empirically profiled.
This hybrid approach yields search accelerations of 12–68× compared to brute-force methods (Huang et al., 15 Dec 2025).

5. Experimental Evaluation

FlashFuser was evaluated on an NVIDIA H100 (SXM), using CUDA 12.4, PyTorch 2.6, TVM 0.9, and Triton 3.2. Across a suite of operator subgraphs (GEMM chains, convolution chains, Gated-FFN):

GEMM chains: mean kernel speedup 5.4× over BOLT, 4.6× over Chimera, 3.1× over PyTorch.
Convolution chains: 6.3× over BOLT, 6.4× Chimera, 3.9× PyTorch.
Gated-FFN: up to 4.1× over Chimera.
DSM primitives utilized 80–90% of theoretical bandwidth for cluster sizes up to 8.
Nsight Compute profiling confirmed FlashFuser reduces DRAM traffic by 58% on average.
An ablation revealed that dataflow analysis alone (no DSM, no cost search) yields 1.52× speedup over baseline; adding random DSM comm yields 2.11×; the full engine achieves 3.29×.
End-to-end inference speedups: average 1.24× over SGLang for Llama-7B, Qwen2.5-3B, GPT-6.7B; for models with 32B–70B parameters, 1.16–1.22× gains were observed.

These results establish the viability of systematic inter-SM DSM-enabled fusion as a means to surmount the memory wall in memory-bound deep learning inference (Huang et al., 15 Dec 2025).

6. Limitations and Scope of Applicability

Acceleration is limited for fully compute-bound kernels, where operators saturate the compute roof independent of memory bandwidth (roofline analysis).
DSM benefits decrease for clusters larger than 8 due to increased cross-core latency; practical cluster size is therefore hardware-limited.
For models or workloads with small $M$ or $N$ , the overhead of DSM collectives may outweigh their benefits, resulting in minimal gain.
FlashFuser’s approach is directly reliant on the existence and quality of inter-core DSM hardware; generalization to other architectures will require hardware support for similar collective primitives.

A plausible implication is that as future GPU architectures continue to enhance on-chip connectivity and DSM bandwidth, compiler frameworks like FlashFuser may become central to computational graph optimization under increasingly severe memory bottlenecks (Huang et al., 15 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core Connection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to FlashFuser.