TurboFNO: High-Performance GPU FNO
- TurboFNO is a high-performance, GPU-optimized framework that fuses FFT, spectral filtering, CGEMM, and iFFT into one kernel to reduce global memory overhead.
- It employs custom FFT truncation, hidden-dimension transforms, and shared memory swizzling to optimize critical spectral operations and maximize on-chip efficiency.
- TurboFNO achieves up to 250% speedup over standard FNO implementations, enabling real-time simulations of complex PDEs such as compressible Rayleigh–Taylor turbulence.
TurboFNO is a high-performance, architecture-aware implementation of the Fourier Neural Operator (FNO) framework, designed to efficiently learn solution operators for partial differential equations (PDEs) on modern NVIDIA GPUs. It achieves significant acceleration by fusing all core spectral operations—Fast Fourier Transform (FFT), elementwise spectral filtering, complex-valued matrix multiplication (CGEMM), zero-padding, and the inverse FFT (iFFT)—into a single GPU kernel and by embedding domain- and task-tailored modeling strategies for scientific computing, such as large-eddy simulation (LES) of compressible turbulence (Wu et al., 16 Apr 2025, Luo et al., 2024).
1. Fourier Neural Operators: Formulation and Standard GPU Implementation
The Fourier Neural Operator extends neural network architectures to infinite-dimensional operator learning. Given an input feature map , a single FNO layer computes: where and are forward and inverse FFTs (acting along spatial axes), is a trainable complex-valued filter for spectral mode truncation, is a learned pointwise linear map, and denotes a pointwise nonlinearity (typically ReLU) (Wu et al., 16 Apr 2025).
On standard GPU architectures, this computation is decomposed into a series of kernel launches:
- Forward FFT via cuFFT,
- Copying low-frequency modes to a smaller buffer,
- CGEMM in frequency space,
- Zero-padding results back to full resolution,
- iFFT for spatial reconstruction,
- Pointwise addition with and nonlinearity application.
This staged pipeline results in high global-memory traffic, underutilizes on-chip memory, and incurs heavy kernel-launch overheads.
2. Architecture-Aware Kernel Fusion in TurboFNO
TurboFNO replaces the modular, multi-kernel pipeline of standard FNO with a single fused CUDA kernel that eliminates all intermediate global-memory access between FFT, CGEMM, and iFFT. Each thread block in TurboFNO executes the following sequence (Wu et al., 16 Apr 2025):
- Loads a tile of real input data from global memory,
- Performs a forward FFT, embedding high-frequency truncation and zero-padding,
- Immediately accumulates the spectral representation into CGEMM’s shared-memory operand,
- Iterates over the hidden (channel) dimension—a "k-loop" that aligns FFT computation and matrix multiplications,
- Completes matrix accumulation,
- Writes GEMM output into shared memory using a memory-bank-optimized "swizzle" pattern,
- Performs the inverse FFT on the result, again in-place in shared memory,
- Writes the final real-space result to global memory after biasing and nonlinearity.
The core kernel pseudocode is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
threadblock_C = 0
// 1. FFT, truncate, pad
A_frag = FFT_Truncate_Pad( Gmem_input[threadblock_C] )
__syncthreads()
for k_chunk in 0 .. HiddenDim step k_tb:
if k_chunk > 0:
A_frag = FFT_Truncate_Pad( Gmem_input[threadblock_C + k_chunk] )
__syncthreads()
// 2. CGEMM accumulate
C_frag += A_frag × B_frag[k_chunk]
end for
__syncthreads()
// 3. swizzled shared memory store
swizzled_store( Shared_C, C_frag )
__syncthreads()
// 4. iFFT
out_real = iFFT( Shared_C )
store( Gmem_output[threadblock_C], σ( out_real + W_frag ) ) |
3. FFT, GEMM, and Shared Memory Optimization Strategies
TurboFNO's fused kernel architecture depends on three major innovations to maximize throughput and minimize contention (Wu et al., 16 Apr 2025):
- Custom FFT kernel with embedded truncation, zero-padding, and butterfly operation pruning: Only the required low-frequency modes are computed and stored, skipping high-frequency bin operations and reducing arithmetic load where possible. If, for example, only 25% of FFT modes are required, the majority of complex multiply–add operations are skipped entirely. This is not achievable using generic cuFFT backends, which require separate copy/truncate/pad stages.
- FFT along the hidden channel (latent) dimension: Classic 2D FFTs operate along X and Y axes. To fuse with the k-dimension of GEMM, TurboFNO performs transforms along the HiddenDim axis, so that each FFT block aligns with a CGEMM A-tile. This mapping allows direct tile-wise dataflow between FFT and matrix-multiply steps within each thread block.
- Shared memory swizzling: Two swizzling modes are applied. First, for FFT → CGEMM, memory layouts are re-mapped such that each thread stores consecutive elements in column-major order, achieving 100% shared-memory bank utilization and avoiding bank conflicts in both FFT and GEMM accesses. After GEMM, for CGEMM → iFFT, row-block stores are staggered by thread-index-dependent offsets, again ensuring 100% utilization with zero bank conflicts, even with zero-padding.
4. TurboFNO for Compressible Rayleigh–Taylor Turbulence
TurboFNO has been specialized for the large-eddy simulation of three-dimensional, compressible Rayleigh–Taylor (RT) turbulence, which involves modeling six coupled physical fields: density (), three velocity components (, , ), temperature (), and heavy-fluid concentration () (Luo et al., 2024). The enhancements are:
- Multifield prediction: Inputs and outputs are six-dimensional vector fields on coarse grids.
- RMS normalization scheme: At each inference step, every field is normalized by its prior-step root-mean-square, computed at .
- Temporal input windows: TurboFNO's input at each step is a window of normalized time-slices; the output is the following (sixth) time-slice.
- Coarse-graining and large time-steps: Ground truth from high-fidelity DNS at is coarse-grained and subsampled so that each FNO prediction advances the field by 60 DNS time-steps (), facilitating long physical-step integration.
- Architectural configuration: The operator employs 4 Fourier layers, each with latent channels, truncated to modes , with channel-wise linear lifts and projections, and ReLU activations.
5. Learning and Validation Methodology
TurboFNO is trained on 305 randomized RT-unstable initializations at , , . For each simulation, the first of data are coarse-grained and downsampled, providing 300 sequences for training and 5 for testing (Luo et al., 2024). At each training step, a five-step history is mapped to the next normalized field, using relative error across all fields and grid points as the loss: Adam optimizer is used (, batch size 10, 50 epochs), with convergence typically reached by epoch 30–40. The model exhibits strong generalization: when applied without retraining to higher Reynolds number () DNS, TurboFNO captures scaling and statistics across turbulent regimes.
A detailed workflow for inference is:
- Normalize each field in the history window by prior-step RMS value,
- Stack these as input ,
- Lift via a linear layer to the latent dimension,
- Iterate four spectral layers (with truncated FFT, learned filters, and ReLU),
- Project back to six fields and denormalize using stored RMS,
- Slide the window forward for prediction.
6. Computational Performance and Comparative Analysis
On the NVIDIA A100 GPU, TurboFNO's custom FFT matches or exceeds cuFFT throughput for 1D and 2D FNO layers, and custom CGEMM matches cuBLAS performance for large, tall-skinny matrix shapes (relevant in FNO applications). The fused FFT–GEMM–iFFT kernel delivers:
- Up to 150% (1D) and 250% (in selected 1D setups) speedup over cuFFT+cuBLAS/PyTorch;
- 44% average improvement in 1D, 67% in 2D FNO layers over a range of batch sizes and hidden dimensions;
- 100% speedup in 1D layers from FFT pruning and truncation alone;
- GPU inference over 200–600× faster than traditional LES methods on multicore CPUs for RT turbulence.
- Achieves near-perfect bank utilization in shared memory across all fused computation stages.
| Implementation | Speedup over PyTorch (FNO) | Key Optimization |
|---|---|---|
| FFT pruning+truncation (1D) | up to 100% | Built-in low-frequency filtering |
| Full FFT–GEMM–iFFT fusion (1D) | up to 150–250% | Single CUDA kernel, no intermediate global mem ops |
| FFT pruning+truncation (2D) | 50–100% | Application of custom FFT kernel |
| Full FFT–GEMM–iFFT fusion (2D) | 50–105% | Fused thread-block computation, shared mem swizzle |
TurboFNO's architecture-aware fusion and resource utilization set a new performance baseline for neural spectral operators, enabling real-time high-fidelity simulations in complex scientific domains (Wu et al., 16 Apr 2025, Luo et al., 2024).
7. Applications, Significance, and Extensions
TurboFNO advances the use of spectral neural operators by combining data-driven accuracy with best-in-class computational efficiency, particularly in large-scale simulation scenarios. In compressible RT turbulence, it achieves closure accuracy superior to classical subgrid-scale (SGS) models, reproducing statistical metrics—mixing heights, kinetic energy, enstrophy, Reynolds and Mach numbers—with less than error relative to filtered DNS, and showing robust generalization to higher Reynolds numbers without retraining (Luo et al., 2024).
A plausible implication is that TurboFNO's fusion paradigm is directly extensible to other PDE learning contexts and FNO variants with spectral layers, wherever GPU memory bandwidth and kernel-launch costs dominate performance. Its strategy of shared-memory-centric computation, bank-conflict-free layouts, and algorithmic pruning for task-specific requirements (such as spectral truncation) provides a framework for future, architecture-tailored neural operator implementations.