Fused FFT-GEMM-iFFT GPU Kernels

Updated 8 March 2026

Fused FFT-GEMM-iFFT GPU kernels are integrated compute units that execute forward FFT, GEMM, and inverse FFT in one pass, eliminating intermediate global memory storage.
They utilize compile-time metaprogramming and optimized data-parallel patterns to maximize register use, shared memory occupancy, and fine-grained parallelism.
These kernels achieve 2×–2.5× speedups in scientific computing and machine learning tasks by reducing data movement and kernel launch overhead.

Fused FFT-GEMM-iFFT GPU kernels are monolithic GPU kernels that execute a sequence of forward Fast Fourier Transform (FFT), General Matrix-Matrix Multiplication (GEMM), and inverse FFT (iFFT) operations in a single pass, eliminating all intermediate storage in global memory and minimizing kernel launch overhead. By orchestrating these functionally distinct but mathematically compatible transforms within one optimized kernel, such designs maximize utilization of on-chip memory (registers and shared memory), exploit fine-grained parallelism, and reduce data-movement bottlenecks that arise in traditional staged library calls. These techniques are foundational for accelerating scientific computing and machine learning workloads, particularly spectral neural operators and high-throughput signal processing pipelines (Amoros et al., 9 Aug 2025, Wu et al., 16 Apr 2025).

1. Mathematical Formulation

The fused FFT-GEMM-iFFT workflow computes

forward FFT: $X[k] = \sum_{n=0}^{N-1} x[n]\,e^{-2\pi i k n/N}$ for $k = 0, \ldots, N-1$
frequency domain transform (via GEMM on two $N\times N$ complex matrices): $C[p,q] = \sum_{r=0}^{N-1} A[p,r] \cdot B[r,q]$
inverse FFT: $x'[n] = \frac{1}{N}\sum_{k=0}^{N-1} C[k]\,e^{2\pi i k n/N}$

In the context of Fourier Neural Operators or scientific simulation, the pipeline typically applies a batched FFT to the input, multiplies (elementwise or as a batched GEMM) by a learned or physical kernel, and reconstructs the spatial result via iFFT (Amoros et al., 9 Aug 2025, Wu et al., 16 Apr 2025).

Fused designs may include internal spectral truncation (frequency domain masking: $\mathcal T_{k_{\max}}\hat f$ ), zero-padding, or pruning directly in the fused kernel to avoid additional data movement.

2. Abstractions and Metaprogramming Methodologies

The Fused Kernel Library (FKL) approach abstracts GPU compute via three composable layers:

Ops: Arithmetic or memory operations (unary/binary device functions).
IOps (Instantiable Ops): Ops with encapsulated runtime parameters, such as pointers and transformation descriptors.
DPPs (Data-Parallel Patterns): Device "driver" functions that manage thread/block decomposition, memory allocation, and execute a sequence of IOp::exec steps.

A fused kernel is generated as a templated global function parameterized by a sequence of IOps and a DPP, with inline static reflection, metaprogramming, and compile-time trait logic to generate fully inlined code (Amoros et al., 9 Aug 2025). This removes the need for precompiled or hand-fused kernels and exposes a library-like, type-safe C++17 interface for arbitrary fusion patterns.

For instance, chaining FFT, GEMM, and iFFT involves:

Defining separate IOp-wrapped template parameters for each compute.
Expressing their data dependencies and fusion sequence in a custom DPP, e.g., FFTGEMMIFFT_DPP.
At host code, providing parameter structs and launching via a universal fk::launch<DPP<IOp...>>, where the kernel auto-configures itself for problem shape and resource requirements at compile time.

3. Kernel Architecture: Memory Layout, Tiling, and Shared Memory Scheduling

Efficient fused kernels are critically dependent on thread blocking, register allocation, and shared memory management:

FFT: Utilizes 1D or 2D Stockham decomposition; each thread typically processes a tile or stripe of the input, holding intermediate stages in registers.
GEMM: Employs 2D tiling, with shared memory double-buffering of $A$ and $B$ tiles per block, so that all partial products can accumulate in registers with minimal spills.
iFFT: Mirrors the FFT scheme but in reverse, reusing the same block structure and potentially sharing the same temporary memory.

Compile-time tuning (via constexpr evaluation in the DPP) auto-selects tile sizes (tileM, tileN, tileK), shared memory footprint, and launch grid/block dimensions, balancing resource occupation (register use, SM concurrency, shared memory capacity) against occupancy (Amoros et al., 9 Aug 2025).

In practical deployments such as TurboFNO (Wu et al., 16 Apr 2025), bank-conflict–free shared-memory swizzling is employed:

FFT output is written into shared memory in column-major (GEMM-ready) layout, with thread index remapping to maximize bank utilization.
GEMM output (before iFFT) is laid out so columns can be read efficiently by iFFT, staggers read/write access patterns to guarantee 100% shared memory bank utilization.

4. Sample Kernel Skeletons and Data Residency

A canonical fused kernel generated via FKL or hand-written (TurboFNO) will:

Load each thread's input into registers.
Apply FFT (typically in multiple "butterfly" stages, unrolled in template instantiations), retaining partial results in registers or shared memory.
Store partial FFT outputs into shared memory tiles as the GEMM $A$ matrix; load $B$ tiles as needed.
Perform GEMM accumulation in registers using loaded tiles, writing partial/complete results to shared memory when necessary.
Apply iFFT on register/shared-memory intermediates, again with all computation performed without returning to global memory.
Output the final results to DRAM/global memory.

All intermediate data flows (partial spectra, tile accumulators) remain local to registers/shared memory. No intermediate stage is written back to DRAM, removing redundant global memory traffic and associated bandwidth/latency (Amoros et al., 9 Aug 2025, Wu et al., 16 Apr 2025).

5. Performance Analysis

Fused kernels achieve substantial speedup over standard staged invocation of FFT and GEMM libraries (e.g., cuFFT → cuBLAS → cuFFT), primarily by eliminating redundant reads/writes and kernel launch latencies. Representative benchmarks:

On NVIDIA RTX 4090 (N=8192, batch=64) (Amoros et al., 9 Aug 2025):

Operation	Time
cuFFT (forward)	5.2 ms
cuBLAS::GEMM	12.8 ms
cuFFT (inverse)	5.3 ms
Total, staged	~23.3 ms
FusedFFT_GEMM_iFFT (fused kernel)	10.4 ms
End-to-end speedup	2.24×

On NVIDIA A100-40GB PCIe (Wu et al., 16 Apr 2025):

Problem	Baseline (cuFFT+cuBLAS)	Fused TurboFNO	Speedup
1D: BS=8, K=64	1.20 ms, 250 GFLOPs	0.68 ms, 440 GFLOPs	1.76×
2D: BS=8, K=64	2.80 ms, 180 GFLOPs	1.45 ms, 350 GFLOPs	1.93×
2D: BS=64, K=128	6.90 ms, 240 GFLOPs	3.50 ms, 480 GFLOPs	1.97×

Fused kernels reduce global memory traffic by approximately 66%, increase shared-memory bandwidth and occupancy, and improve roofline operational intensity, moving computation from bandwidth-bound to compute-bound regimes (Amoros et al., 9 Aug 2025, Wu et al., 16 Apr 2025).

6. Implementation Trade-offs and Limitations

Key limitations and considerations in fused FFT-GEMM-iFFT kernel design include:

Code size and compile time: Extensive use of template recursion (especially for log₂N FFT stages, GEMM unrolling) results in large kernels and may trigger compiler instantiation depth limits.
Register and shared memory pressure: High register use (e.g., ~32 complex registers per thread in FKL, ~64 float2/thread in TurboFNO) can reduce SM occupancy or cause spills. Shared-memory allocation per block must be managed to avoid overallocation.
Parameter flexibility: Many fusion parameters (FFT length N, tiling factors) must be compile-time constants for maximal inlining and performance; dynamic shapes may require kernel recompilation or launching multiple code paths.
Fusion depth and pattern consistency: All ops must share a common DPP/threading pattern; divergent memory access or incompatible threading preclude fusion in the current FKL model.
Hardware specificity: Kernel configurations (tile size, unrolling, swizzling patterns) often require hardware-specific tuning; optimal settings for NVIDIA may not transfer to AMD or Intel GPUs (Amoros et al., 9 Aug 2025, Wu et al., 16 Apr 2025).
Numerical accuracy: In-kernel truncation (spectral masking at $|k| > k_\mathrm{max}$ ) acts as a low-pass filter; for FNO, $k_\mathrm{max}$ at $25$– $50\%$ of $N$ typically increases solution error by $<0.1\%$ . Butterfly pruning in the FFT introduces negligible floating-point error ( $<1$ ULP) (Wu et al., 16 Apr 2025).

7. Applications and Broader Context

Fused FFT-GEMM-iFFT GPU kernels are critical in domains where compound spectral transforms are bottlenecked by memory bandwidth and kernel launch overhead, such as:

Fourier Neural Operators (FNOs) and spectral PDE solvers (Wu et al., 16 Apr 2025)
Signal and image processing pipelines requiring spectral filtering and pointwise transform chains
Scientific computing workloads featuring convolutions, cross-correlation, or Green’s function evaluation in spectral domains

Their compositionality and programmability, as realized in frameworks like FKL, provide high-level abstractions over hardware-specific optimization, facilitating rapid prototyping and deployment of bespoke compute pipelines with minimal overhead (Amoros et al., 9 Aug 2025). The demonstrated 2×–2.5× speedups and improved memory/compute efficiency have catalyzed the adoption of these kernels in state-of-the-art GPU-accelerated libraries and ML frameworks.

References:

(Amoros et al., 9 Aug 2025) The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries (Wu et al., 16 Apr 2025) TurboFNO: High-Performance Fourier Neural Operator with Fused FFT-GEMM-iFFT on GPU

Markdown Report Issue Upgrade to Chat

References (2)

The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries (2025)

TurboFNO: High-Performance Fourier Neural Operator with Fused FFT-GEMM-iFFT on GPU (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fused FFT-GEMM-iFFT GPU Kernels.

Fused FFT-GEMM-iFFT GPU Kernels

1. Mathematical Formulation

2. Abstractions and Metaprogramming Methodologies

3. Kernel Architecture: Memory Layout, Tiling, and Shared Memory Scheduling

4. Sample Kernel Skeletons and Data Residency

5. Performance Analysis

6. Implementation Trade-offs and Limitations

7. Applications and Broader Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Fused FFT-GEMM-iFFT GPU Kernels

1. Mathematical Formulation

2. Abstractions and Metaprogramming Methodologies

3. Kernel Architecture: Memory Layout, Tiling, and Shared Memory Scheduling

4. Sample Kernel Skeletons and Data Residency

5. Performance Analysis

6. Implementation Trade-offs and Limitations

7. Applications and Broader Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research