Dual-Select FMA Butterfly for FFT: Eliminating Twiddle Factor Singularities with Bounded Precomputed Ratios

Published 1 Apr 2026 in cs.PF | (2604.00567v1)

Abstract: The fused multiply-add (FMA) instruction enables the radix-2 FFT butterfly to be computed in 6~FMA operations -- the proven minimum. The classical factorization by Linzer and Feig~\cite{linzer1993} precomputes the ratio $\cotθ= \cosθ/\sinθ$, which is singular when the twiddle factor is $W⁰ = 1$ (i.e., $\sinθ= 0$). Standard practice clamps $\sinθ$ to a small epsilon, degrading numerical precision. We observe that an alternative factorization using $\cosθ$ as the outer multiplier (precomputing $\tanθ$) avoids this particular singularity but introduces a new one at $W^{N/4}$. We then propose a \emph{dual-select} strategy that chooses, per twiddle factor, whichever factorization yields $|\text{ratio}| \leq 1$. This eliminates all singularities, requires no epsilon clamping, and bounds the precomputed ratio to unity for all twiddle factors. For $N = 1024$, the worst-case ratio drops from 163 (Linzer-Feig) to exactly~1.0 (dual-select), yielding a $235\times$ tighter error bound in FP16 arithmetic over 10~FFT passes. The strategy adds zero computational overhead -- only the precomputed twiddle table changes.

Abstract PDF Upgrade to Chat

Authors (1)

Mohamed Amine Bergach

Summary

The paper introduces a dual-select FMA strategy that eliminates twiddle factor singularities by adaptively choosing between sine and cosine paths.
It achieves a 235× reduction in maximum relative error for FP16 implementations without increasing the FMA operation count.
The method generalizes to various FFT sizes and hardware platforms, serving as a robust drop-in replacement for low-precision FFT kernels.

Dual-Select FMA Butterfly for FFT: Eliminating Twiddle Factor Singularities with Bounded Precomputed Ratios

Overview

The paper introduces a numerically robust FMA-based computational kernel for the radix-2 FFT butterfly, which eliminates all twiddle factor singularities by leveraging an adaptive precomputed ratio selection per twiddle factor. The approach systematically eradicates the need for pathological subnormals or epsilon clamping in the presence of floating-point underflow at singular points, guaranteeing absolute ratio bounds for all twiddle factors. The resulting kernel delivers strong improvements in numerical stability, particularly in FP16 implementations where prior approaches yield catastrophic error amplification.

Factorization Methods and Singularities

The core operation in the radix-2 FFT—combining two complex numbers with a complex twiddle multiplication—can be mapped to six FMA instructions, the proven theoretical minimum. The classical Linzer-Feig factorization reformulates the butterfly outputs using a precomputed ratio $t = \cot\theta$ , extracting $\sin\theta$ as the outer multiplier, but this introduces a singularity at $\theta = 0$ due to division by zero. Traditional mitigation employs artificial epsilon substitution, which injects systematic twiddle error and degrades numerical precision.

A complementary cosine path extracts $\cos\theta$ as the outer multiplier and precomputes $t = \tan\theta$ . While this avoids the zero singularity, it merely shifts the problem to $\theta = \pi/2$ , where again a zero denominator occurs. The crucial observation is that these pathological domains are non-overlapping: at any $\theta$ , either $|\tan\theta|$ or $|\cot\theta|$ is less than or equal to one, and the maximum is achieved at $\theta = \pi/4$ . This symmetry motivates the dual-select approach.

Dual-Select Ratio Selection

The proposed dual-select algorithm, for each twiddle factor, chooses between the sine and cosine path based on the magnitude of $\sin\theta$ 0 and $\sin\theta$ 1. If $\sin\theta$ 2, the cosine path is used; otherwise, the sine path is selected. This ensures that the precomputed ratio always satisfies $\sin\theta$ 3, categorically eliminating the possibility of large intermediate values or division by near-zero and guaranteeing bounded error propagation regardless of $\sin\theta$ 4 or twiddle index.

Implementation overhead is negligible: it suffices to augment the precomputed twiddle table with a single selection flag per factor and modify the FMA code path accordingly. Storage requirements are trivial compared to the twiddle table size itself, and no runtime divergence penalty occurs on common SIMD/MIMD GPU hardware.

Numerical Precision Analysis

The dominant error source in FMA-based FFTs is determined by the magnitude of the ratio in the butterfly computation. For Linzer-Feig, the maximum ratio for $\sin\theta$ 5 is 163, resulting in direct multiplication of the machine epsilon by this factor during error amplification. In contrast, the dual-select method produces a worst-case ratio of exactly 1.0. Cumulative error, particularly relevant for deep (high $\sin\theta$ 6) FFTs, is compounded over $\sin\theta$ 7 butterfly passes. The dual-select method yields a $\sin\theta$ 8 reduction in maximum relative error bound for FP16 over 10 passes—dropping from $\sin\theta$ 9 (effectively unusable) to under $\theta = 0$ 0.

In FP32, both methods exhibit error below the floating point noise floor, so the improvement is most salient for low-precision (e.g., FP16) workloads where classical FFT implementations are otherwise numerically infeasible. This directly affects deployment on AI accelerators and GPU architectures with high-throughput, low-precision units.

Generality and Implementation Implications

A key theoretical contribution is that the dual-select principle is radix- and size-agnostic; for any valid twiddle product in a Stockham or Cooley-Tukey FFT, the optimal minimal factorization can be selected on a per-operation basis with no impact on operation count or control flow complexity. This generalizes the classical single-path FMA kernel to a fully robust computational primitive for the FFT.

The elimination of singularities without arithmetic clamping also yields an analytically predictable and sharply bounded worst-case error profile. The kernel is immediately applicable to scenarios where FFT is performed in mixed or half-precision, serving both as a drop-in replacement for existing GPU kernels and as a blueprint for hardware-accelerated FFT microcode.

Potential and Future Directions

By guaranteeing bounded amplification in FP16, the dual-select kernel facilitates practical deployment of high-throughput FFT-based algorithms in applications such as high-resolution radar, comms DSP, neural network inference (spectral layers), and scientific computing where low-precision arithmetic is otherwise precluded by precision concerns.

Future directions may include extending the dual-select principle to radix- $\theta = 0$ 1 butterflies, fusing additional twiddle multiplications in multi-radix transforms, and integrating selection logic directly into memory-mapped twiddle tables or hardware FFT pipelines. Additionally, this work provides a foundation for further reductions in communication and memory bandwidth, as tighter error bounds support more aggressive quantization, batching, or vectorization strategies.

Conclusion

The dual-select FMA butterfly strategy efficiently and systematically eliminates all twiddle factor singularities in radix-2 FFT implementations by adaptively selecting between sine and cosine-based ratio factorizations. This achieves a uniformly bounded precomputed ratio for any twiddle factor and substantially reduces cumulative numerical error, especially in FP16 settings where classical methods are numerically unstable. The method incurs zero runtime or computational overhead and preserves the minimal FMA operation count, thus enabling high-throughput, low-precision FFT computation in modern hardware and establishing a new standard for robust FFT butterfly implementations.

Markdown Report Issue