TurboFFT: Co-Designed High-Performance and Fault-Tolerant Fast Fourier Transform on GPUs (2412.05824v1)

Published 8 Dec 2024 in cs.DC

Abstract: GPU-based fast Fourier transform (FFT) is extremely important for scientific computing and signal processing. However, we find the inefficiency of existing FFT libraries and the absence of fault tolerance against soft error. To address these issues, we introduce TurboFFT, a new FFT prototype co-designed for high performance and online fault tolerance. For FFT, we propose an architecture-aware, padding-free, and template-based prototype to maximize hardware resource utilization, achieving a competitive or superior performance compared to the state-of-the-art closed-source library, cuFFT. For fault tolerance, we 1) explore algorithm-based fault tolerance (ABFT) at the thread and threadblock levels to reduce additional memory footprint, 2) address the error propagation by introducing a two-side ABFT with location encoding, and 3) further modify the threadblock-level FFT from 1-transaction to multi-transaction in order to bring more parallelism for ABFT. Our two-side strategy enables online correction without additional global memory while our multi-transaction design averages the expensive threadblock-level reduction in ABFT with zero additional operations. Experimental results on an NVIDIA A100 server GPU and a Tesla Turing T4 GPU demonstrate that TurboFFT without fault tolerance is comparable to or up to 300\% faster than cuFFT and outperforms VkFFT. TurboFFT with fault tolerance maintains an overhead of 7\% to 15\%, even under tens of error injections per minute for both FP32 and FP64.

Summary

The paper introduces a GPU-optimized FFT that integrates two-sided ABFT for fault tolerance, outperforming cuFFT by up to 300% in speed.
It employs a novel, padding-free and template-based design using memory coalescing and twiddle factor optimization for superior hardware utilization.
Extensive benchmarks on NVIDIA GPUs show that TurboFFT maintains robust error correction with only 7-15% overhead even under frequent error injections.

Analysis and Review of "TurboFFT: Co-Designed High-Performance and Fault-Tolerant Fast Fourier Transform on GPUs"

The paper under review introduces TurboFFT, a novel Fast Fourier Transform (FFT) prototype specifically designed for high-performance operations on Graphics Processing Units (GPUs) combined with mechanisms for fault tolerance. This research addresses two primary shortcomings observed in existing FFT libraries: inefficiencies in computational execution on GPUs and the absence of robust fault tolerance against soft errors.

Key Contributions

The primary contributions of TurboFFT can be delineated as follows:

High-Performance Architecture-Aware Design: TurboFFT offers a new FFT implementation that rivals industry-standard libraries such as NVIDIA's cuFFT. The authors present an architecture-aware, padding-free, and template-based design to maximize hardware resource utilization. This approach includes enhancements like memory coalescing strategies, twiddle factor optimization, and shared memory utilization without padding but with swizzling. These design choices allow TurboFFT to surpass the performance metrics of existing solutions in significant benchmarks, particularly for smaller FFT sizes on an NVIDIA A100 GPU.
Two-Side Algorithm-Based Fault Tolerance (ABFT): The paper explores a two-sided ABFT scheme enabling online fault correction without additional global memory overhead. The introduction of a two-side strategy, which includes location encoding at both the thread and threadblock levels, allows the detection and correction of errors through minimal overhead. This two-level strategy averages the costlier threadblock-level reductions, achieving error resilience with a mere 7-15% performance overhead even under frequent error injections.
Extensive Benchmarking and Performance Validation: Empirical results presented demonstrate that TurboFFT achieves performance metrics up to 300% faster than cuFFT in scenarios without fault tolerance and remains comparably efficient when fault tolerance is enacted. The authors validate their claims with extensive tests on both single and double precision floating-point operations on two GPU architectures, the NVIDIA A100 and Tesla Turing T4 GPUs.

Implications and Future Directions

The contributions of TurboFFT are significant for both computational science and applied fields reliant on high-efficiency FFT computations. The ability to execute highly optimized FFT operations on GPUs while also maintaining resilience against soft errors will prove advantageous for applications in scientific computing, signal processing, and large-scale simulations where FFT workloads dominate computational costs.

The research implies a possible paradigm shift in developing FFT libraries towards integrating fault tolerance mechanisms within the computational framework itself rather than as an added layer of complexity, thus maintaining high throughput and low overhead. The templated design strategy further suggests an avenue for adaptive FFT library implementations capable of tuning themselves to the specifics of different hardware architectures or application-specific workloads.

As future work, researchers might explore expanding the fault-tolerant framework of TurboFFT to other computational techniques, such as scientific data compression and broader signal processing methodologies. Additionally, with the rapid evolution of GPU architectures, an ongoing evaluation of TurboFFT's performance against new hardware capabilities will highlight further optimization points.

In conclusion, TurboFFT presents a commendable advancement in FFT computation methodologies on GPUs. By simultaneously addressing performance and reliability concerns, the authors offer a compelling solution that is both practically relevant and theoretically robust. Researchers and practitioners engaging with GPU-based computational tasks will likely find significant value in adopting and extending this work.