- The paper introduces a dual-select FMA strategy that eliminates twiddle factor singularities by adaptively choosing between sine and cosine paths.
- It achieves a 235× reduction in maximum relative error for FP16 implementations without increasing the FMA operation count.
- The method generalizes to various FFT sizes and hardware platforms, serving as a robust drop-in replacement for low-precision FFT kernels.
Dual-Select FMA Butterfly for FFT: Eliminating Twiddle Factor Singularities with Bounded Precomputed Ratios
Overview
The paper introduces a numerically robust FMA-based computational kernel for the radix-2 FFT butterfly, which eliminates all twiddle factor singularities by leveraging an adaptive precomputed ratio selection per twiddle factor. The approach systematically eradicates the need for pathological subnormals or epsilon clamping in the presence of floating-point underflow at singular points, guaranteeing absolute ratio bounds for all twiddle factors. The resulting kernel delivers strong improvements in numerical stability, particularly in FP16 implementations where prior approaches yield catastrophic error amplification.
Factorization Methods and Singularities
The core operation in the radix-2 FFT—combining two complex numbers with a complex twiddle multiplication—can be mapped to six FMA instructions, the proven theoretical minimum. The classical Linzer-Feig factorization reformulates the butterfly outputs using a precomputed ratio t=cotθ, extracting sinθ as the outer multiplier, but this introduces a singularity at θ=0 due to division by zero. Traditional mitigation employs artificial epsilon substitution, which injects systematic twiddle error and degrades numerical precision.
A complementary cosine path extracts cosθ as the outer multiplier and precomputes t=tanθ. While this avoids the zero singularity, it merely shifts the problem to θ=π/2, where again a zero denominator occurs. The crucial observation is that these pathological domains are non-overlapping: at any θ, either ∣tanθ∣ or ∣cotθ∣ is less than or equal to one, and the maximum is achieved at θ=π/4. This symmetry motivates the dual-select approach.
Dual-Select Ratio Selection
The proposed dual-select algorithm, for each twiddle factor, chooses between the sine and cosine path based on the magnitude of sinθ0 and sinθ1. If sinθ2, the cosine path is used; otherwise, the sine path is selected. This ensures that the precomputed ratio always satisfies sinθ3, categorically eliminating the possibility of large intermediate values or division by near-zero and guaranteeing bounded error propagation regardless of sinθ4 or twiddle index.
Implementation overhead is negligible: it suffices to augment the precomputed twiddle table with a single selection flag per factor and modify the FMA code path accordingly. Storage requirements are trivial compared to the twiddle table size itself, and no runtime divergence penalty occurs on common SIMD/MIMD GPU hardware.
Numerical Precision Analysis
The dominant error source in FMA-based FFTs is determined by the magnitude of the ratio in the butterfly computation. For Linzer-Feig, the maximum ratio for sinθ5 is 163, resulting in direct multiplication of the machine epsilon by this factor during error amplification. In contrast, the dual-select method produces a worst-case ratio of exactly 1.0. Cumulative error, particularly relevant for deep (high sinθ6) FFTs, is compounded over sinθ7 butterfly passes. The dual-select method yields a sinθ8 reduction in maximum relative error bound for FP16 over 10 passes—dropping from sinθ9 (effectively unusable) to under θ=00.
In FP32, both methods exhibit error below the floating point noise floor, so the improvement is most salient for low-precision (e.g., FP16) workloads where classical FFT implementations are otherwise numerically infeasible. This directly affects deployment on AI accelerators and GPU architectures with high-throughput, low-precision units.
Generality and Implementation Implications
A key theoretical contribution is that the dual-select principle is radix- and size-agnostic; for any valid twiddle product in a Stockham or Cooley-Tukey FFT, the optimal minimal factorization can be selected on a per-operation basis with no impact on operation count or control flow complexity. This generalizes the classical single-path FMA kernel to a fully robust computational primitive for the FFT.
The elimination of singularities without arithmetic clamping also yields an analytically predictable and sharply bounded worst-case error profile. The kernel is immediately applicable to scenarios where FFT is performed in mixed or half-precision, serving both as a drop-in replacement for existing GPU kernels and as a blueprint for hardware-accelerated FFT microcode.
Potential and Future Directions
By guaranteeing bounded amplification in FP16, the dual-select kernel facilitates practical deployment of high-throughput FFT-based algorithms in applications such as high-resolution radar, comms DSP, neural network inference (spectral layers), and scientific computing where low-precision arithmetic is otherwise precluded by precision concerns.
Future directions may include extending the dual-select principle to radix-θ=01 butterflies, fusing additional twiddle multiplications in multi-radix transforms, and integrating selection logic directly into memory-mapped twiddle tables or hardware FFT pipelines. Additionally, this work provides a foundation for further reductions in communication and memory bandwidth, as tighter error bounds support more aggressive quantization, batching, or vectorization strategies.
Conclusion
The dual-select FMA butterfly strategy efficiently and systematically eliminates all twiddle factor singularities in radix-2 FFT implementations by adaptively selecting between sine and cosine-based ratio factorizations. This achieves a uniformly bounded precomputed ratio for any twiddle factor and substantially reduces cumulative numerical error, especially in FP16 settings where classical methods are numerically unstable. The method incurs zero runtime or computational overhead and preserves the minimal FMA operation count, thus enabling high-throughput, low-precision FFT computation in modern hardware and establishing a new standard for robust FFT butterfly implementations.