FusePlanner: GPU Convolution Fusion Optimization
- FusePlanner is a compile-time tool that automates the fusion of adjacent depthwise and pointwise convolutions, reducing global memory access on GPUs.
- It employs analytic cost models and hardware-aware parameter tuning to optimize CUDA kernels by selecting beneficial fusion opportunities.
- Empirical results show up to 3.7× latency speedup and 83% reduction in memory access for modern CNNs and vision transformers.
FusePlanner is a compile-time planning tool designed to automate and optimize the fusion of adjacent depthwise and pointwise convolutional layers in deep neural network inference on GPUs. By utilizing analytic cost modeling and hardware-aware parameter selection, FusePlanner minimizes global memory traffic and maximizes the efficiency of inference engines, with a particular focus on memory-bound workloads found in modern compact convolutional neural networks (CNNs) and vision transformers (ViTs) (Qararyah et al., 2024).
1. Problem Setting and Motivation
Deep neural network inference on GPU hardware often features sequences of depthwise (DW) and pointwise (PW) convolutions. Although these operations have fewer parameters and lower computation compared to standard convolutions, their low compute-to-memory-access ratio causes memory bandwidth—rather than arithmetic throughput—to become the principal performance bottleneck. Fusing adjacent DW and PW operations into a single fused CUDA kernel, referred to as Fused Convolutional Modules (FCMs), can significantly reduce global memory access (GMA), provided the fusion is hardware- and workload-aware.
Existing methods were not explicitly optimized to minimize global memory accesses, did not support fusing depthwise convolutions, or performed fusion only for non-convolutional elementwise operations. FusePlanner addresses these gaps by providing a systematic approach for selecting beneficial fusion opportunities and optimal fused kernel parameters (Qararyah et al., 2024).
2. High-Level Workflow and System Architecture
FusePlanner operates at compile time, ingesting both the computational graph of a model (expressed as a directed acyclic graph over DW and PW convolutions, with optional normalization/activation) and a hardware description of the target GPU.
The core workflow is as follows:
- Model Parsing: The model DAG is parsed to extract per-layer parameters including input/output feature map dimensions, filter sizes, stride, padding, and data type.
- Hardware Query: GPU characteristics—number of streaming multiprocessors (SMs), per-SM L1 cache/shared memory, L2 cache, and peak DRAM bandwidth—are read from the user or queried automatically.
- Unfused Layer Analysis: For each convolutional layer, a layer-by-layer cost model searches a discrete set of tile sizes to estimate the minimal GMA achievable by an unfused kernel.
- Enumerating Fusion Candidates: All eligible adjacent fusion opportunities (DWPW, PWDW, PWDW_R, PWPW) are considered. For each, the fused-kernel cost model is used, adjusting tile parameters to fit cache and occupancy constraints.
- Benefit Assessment: If fused execution yields strictly lower GMA than separate kernels, fusion is marked beneficial.
- Fusion Decision Algorithm: A simple interval-packing or dynamic programming step selects a maximal set of non-overlapping beneficial fusions for deployment.
- Schedule Generation: The output is a per-layer schedule, specifying which kernel to use and the corresponding tile/block parameters.
- Kernel Instantiation: Pre-written, parameterized CUDA kernels (FCMs) are compiled and linked according to the planner output.
This approach eliminates the need for performance profiling loops and hand-tuning, offering a drop-in planning mechanism for both existing and novel GPU targets (Qararyah et al., 2024).
3. Analytic Cost Models and Parameter Search
FusePlanner leverages analytic models to estimate memory access patterns and costs for both unfused and fused executions, under two primary assumptions: (A) warp-level coalesced stride-1 memory accesses, and (B) output-stationary, local-weight-stationary dataflows.
Notation includes:
- : Input feature map width, height, channels.
- : Filter dimensions.
- : Sizes of input/output feature maps and weights.
- Tile sizes: spatial and channel subdivisions chosen from grid-aligned and power-of-two divisors, respectively.
For each operation type (L→LBL for layer-by-layer, FCM for fused kernel), closed-form expressions compute GMA, incorporating overlap for sliding-window access, cache occupancy (L1Sz), and tile counts (for maximizing SM utilization). Fused variants (e.g., PWDW, DWPW, PWPW) omit intermediate writes, further reducing traffic.
Parameter search is discrete, low-dimensional, and hardware-constrained:
- Tile sizes are selected such that all required buffers (input, output, weights, communication) fit within L1/shared, and total tiles ≥ #SMs.
- The minimal GMA over this search space is found for each opportunity in milliseconds.
This mechanism enables rapid, hardware-parameterized customization of fusion choices and kernel implementations (Qararyah et al., 2024).
4. Fusion Selection and Optimization Algorithm
FusePlanner’s fusion-selection step constructs an optimal (greedy or dynamic-programming) cover of the model DAG with non-overlapping maximal fusions, ensuring total memory access is minimized.
The algorithmic process is:
- For each layer, record the minimal achievable unfused (LBL) GMA.
- For each legal fusion window, compare fused GMA against the sum of constituent LBL GMAs.
- Retain only fusions where fused < unfused GMA.
- Construct a DP array where each entry records the minimal total GMA attainable given fusions ending at that position.
- Backtrack to identify the exact kernel schedule (fused vs. unfused) and associated kernel parameters.
Heuristic pruning restricts the search to small fusion windows (typically two adjacent convolutions) and excludes infeasible tile sizes or fusions lacking memory savings.
The following table summarizes decision outputs for each model segment:
| Segment Type | Kernel Type | Tiling/Thread Params |
|---|---|---|
| Fused (e.g., DWPW) | FCM CUDA Kernel | Chosen per planner |
| Unfused | Custom LBL Kernel | Chosen per planner |
All assignments are hardware- and workload-specific (Qararyah et al., 2024).
5. Implementation and Integration
FusePlanner is implemented in Python and C++. Its toolchain includes:
- Front End: Parses models exported from TensorFlow or JSON layer descriptions.
- Hardware Description: Accepts user-provided or auto-discovered GPU specifications.
- Planning Module: Produces per-layer fusion decisions and parameter choices.
- Kernel Library: Supplies optimized CUDA kernels for standard fusion types, parameterized by block/grid configuration and data layout.
- API: Command-line or C interface generates a “schedule” header, e.g., specifying kernel instantiations for build-time linkage.
At runtime, inference executes the generated sequence, alternating between FCMs and LBL kernels as dictated by the plan. Every deployment can be rapidly re-parameterized for new networks or GPU configurations without additional tuning (Qararyah et al., 2024).
6. Empirical Results and Impact
Experimental evaluations across three distinct GPUs (NVIDIA GTX-1660, RTX-A4000, Jetson AGX Orin) and a representative suite of compact CNNs (MobileNetV1/V2, XCeption, ProxylessNAS) and convolutional ViTs (CeiT, CMT) demonstrate the following:
- FCMs, as selected by FusePlanner, achieve up to 3.7× latency speedup and up to 83% global memory access reduction vs. cuDNN.
- Against custom LBL kernels, FCMs yield up to 1.8× speedup (1.3–1.4× on average).
- Complete models fused and planned by FusePlanner, compared against TVM (using cuDNN), show up to 1.8× end-to-end inference speedup and energy reductions to as little as 34–59% of TVM.
- The largest gains are recorded for memory-bound operator pairs, though compute-bound segments also benefit from memory savings (Qararyah et al., 2024).
A plausible implication is that FusePlanner’s methodology can be generalized beyond DW/PW fusions and adopted by other GPU-based inference compilers or integrated into end-to-end DL toolchains.
7. Practical Usage and Limitations
Typical end-to-end usage involves:
- Exporting the model architecture.
- Providing hardware specifications (or auto-detection).
- Running FusePlanner to determine fusion boundaries and kernel parameters.
- Compiling and linking against the FCM kernel library.
- Deploying the fused model binary for inference.
No hand-tuning or profiling is required, and the approach is portable across models and GPU architectures. The tool is open source (Qararyah et al., 2024).
A noted limitation is that only small consecutive windows (usually two layers) are considered for fusion to maintain manageable search spaces. Further, cost models are analytic and may omit rare pathologies in device behavior or memory system interactions, though the planner takes a conservative approach regarding tile feasibility and does not fuse unless benefits are strictly established.
References:
- "Fusing Depthwise and Pointwise Convolutions for Efficient Inference on GPUs" (Qararyah et al., 2024)