Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TurboFNO: High-Performance Fourier Neural Operator with Fused FFT-GEMM-iFFT on GPU (2504.11681v1)

Published 16 Apr 2025 in cs.DC

Abstract: Fourier Neural Operators (FNO) are widely used for learning partial differential equation solution operators. However, FNO lacks architecture-aware optimizations,with its Fourier layers executing FFT, filtering, GEMM, zero padding, and iFFT as separate stages, incurring multiple kernel launches and significant global memory traffic. We propose TurboFNO, the first fully fused FFT-GEMM-iFFT GPU kernel with built-in FFT optimizations. We first develop FFT and GEMM kernels from scratch, achieving performance comparable to or faster than the closed-source SOTA cuBLAS and cuFFT. Additionally, our FFT kernel integrates a built-in high-frequency truncation, input zero-padding, and pruning feature to avoid additional memory copy kernels. To fuse the FFT and GEMM workloads, we propose an FFT variant in which a single thread block iterates over the hidden dimension, aligning with the $k$-loop in GEMM. Additionally, we design two shared memory swizzling patterns to achieve 100\% memory bank utilization when forwarding FFT output to GEMM and enabling the iFFT to retrieve GEMM results directly from shared memory.Experimental result on an NVIDIA A100 GPU shows TurboFNO outperforms PyTorch, cuBLAS, and cuFFT by up to 150\%.

Summary

  • The paper introduces TurboFNO, a fully fused GPU kernel that integrates FFT, frequency truncation, CGEMM, and iFFT to minimize kernel launch overhead and reduce data transfers.
  • It reports significant speedups—averaging 44–67% and reaching up to 150–250% over traditional PyTorch setups—through custom CUDA optimizations.
  • The approach leverages advanced memory management techniques, including shared memory forwarding and swizzling, to prune redundant computations and enhance scalability.

This paper introduces TurboFNO, a high-performance GPU kernel designed to accelerate Fourier Neural Operators (FNOs) by fusing the core computational stages: Fast Fourier Transform (FFT), Complex General Matrix Multiplication (CGEMM), and inverse Fast Fourier Transform (iFFT). Standard FNO implementations, often relying on libraries like PyTorch with cuFFT and cuBLAS, suffer from performance bottlenecks due to multiple kernel launches and excessive data movement between GPU global memory for each stage (FFT, frequency truncation/filtering, GEMM, zero-padding, iFFT).

TurboFNO addresses these inefficiencies through several key innovations:

  1. Fully Fused Kernel: The central contribution is a single GPU kernel that integrates the entire FFT -> Truncation -> CGEMM -> Zero-Padding -> iFFT pipeline. This fusion minimizes kernel launch overhead and significantly reduces reads and writes to global memory by keeping intermediate data in faster shared memory or registers.
  2. Custom High-Performance Kernels: To enable fusion, the authors developed custom CUDA kernels for FFT and CGEMM from scratch. These kernels are optimized for performance and achieve speeds comparable to or exceeding the proprietary cuFFT and cuBLAS libraries, while providing the flexibility needed for integration. The CGEMM uses a standard blocked approach targeting CUDA cores.
  3. Built-in FFT Optimizations:
    • Truncation & Zero-Padding: The custom FFT kernel natively supports frequency truncation (keeping only low-frequency modes, common in FNOs) and zero-padding (needed before iFFT). This eliminates the need for separate memory copy kernels used in standard implementations, directly reducing global memory traffic during the FFT and iFFT stages.
    • Pruning: By knowing which high-frequency outputs are discarded due to truncation, the FFT computation itself is pruned. Redundant butterfly operations, especially multiplications, that only contribute to the discarded frequencies are skipped, reducing the computational load.
  4. Dataflow Alignment for Fusion:
    • FFT-GEMM Alignment: A crucial step for fusion involves modifying the FFT execution pattern. Instead of performing both 1D FFT stages along spatial dimensions, the second stage (or the single stage in 1D FNO) is redesigned to compute along the hidden dimension. This aligns the FFT's data processing direction with the kk-loop of the subsequent GEMM operation.
    • Shared Memory Forwarding: The output of the modified FFT is written directly into shared memory in a layout that matches the expected input tile (operand AA) for the CGEMM kernel. This bypasses global memory entirely for the FFT output.
  5. Shared Memory Swizzling: To prevent performance degradation from shared memory bank conflicts during data forwarding between stages, TurboFNO implements two specific swizzling patterns:
    • FFT -> GEMM: A layout is used where consecutive threads write consecutive elements from the same FFT signal (pencil). While this is non-standard for FFT output, it aligns perfectly with GEMM's column-major access for operand A. To avoid bank conflicts during the FFT's write-to-shared-memory phase with this layout, thread-ID-based address offsets (addr += tid or addr += tid / 2) are applied to ensure threads access different banks simultaneously, achieving 100% bank utilization.
    • GEMM -> iFFT: When writing the GEMM result matrix (CC) to shared memory for the iFFT stage, a similar bank conflict issue arises from the warp's thread mapping to the output tile. A swizzling pattern using an offset based on threadIdx.x / 4 is applied to stagger memory accesses and maintain 100% bank utilization.
  6. iFFT as Epilogue: The iFFT computation is integrated as an epilogue directly following the GEMM computation within the fused kernel. The GEMM result is consumed from shared memory by the iFFT, avoiding a final write of the GEMM result to global memory before the inverse transform.

Implementation Details:

  • The implementation targets CUDA cores on NVIDIA GPUs (tested on A100).
  • The CGEMM uses parameters like block sizes (e.g., M_tb=32, N_tb=32, K_tb=8) and warp-level tile sizes (e.g., m_w=32, n_w=16), implemented using templated CUDA code for flexibility.
  • The Stockham FFT formulation is used for coalesced global memory reads.
  • Pseudocode (Figure 12) illustrates how memory operations for GEMM operand AA and result CC are replaced by FFT and iFFT calls within the fused kernel structure.

Evaluation:

TurboFNO was compared against a PyTorch baseline (using cuFFT/cuBLAS) and a custom CUDA C baseline mimicking PyTorch's separate kernel calls.

  • Performance Gains: TurboFNO demonstrated significant speedups for both 1D and 2D FNO problems.
    • Average speedup over PyTorch was ~44-67%.
    • Maximum speedup reached up to 150-250% depending on the configuration.
  • Optimization Impact:
    • FFT pruning, truncation, and zero-padding provided substantial benefits (often >50% speedup alone), especially in 2D FNO where FFT overhead is larger.
    • Kernel fusion (FFT-GEMM, GEMM-iFFT, full FFT-GEMM-iFFT) provided additional gains, particularly in 1D FNO (~10-20% extra speedup from full fusion). In 2D FNO, the fusion gains were smaller (few percent) as the optimized FFT stage dominated runtime, but full fusion still consistently outperformed partial fusion or FFT optimization alone.
  • Scalability: Performance benefits were generally more pronounced for larger batch sizes and input dimensions, where memory bandwidth limitations are more critical.
  • Robustness: Slowdowns compared to PyTorch only occurred in scenarios with very small batch sizes combined with large hidden dimensions, where the specific thread block mapping strategy of TurboFNO might lead to underutilization.

In summary, TurboFNO presents a practical approach to accelerate FNOs by co-designing and fusing the FFT, GEMM, and iFFT stages into a single, highly optimized GPU kernel. By eliminating intermediate memory transfers, pruning redundant computations, and carefully managing shared memory access patterns, it achieves significant performance improvements over standard library-based implementations, particularly for large-scale scientific applications.

X Twitter Logo Streamline Icon: https://streamlinehq.com