Papers
Topics
Authors
Recent
Search
2000 character limit reached

Permutation-Avoiding FFT Convolution

Updated 12 February 2026
  • The paper introduces a method that defers offline permutation on fixed filters, eliminating runtime reordering in FFT-based convolution.
  • The approach restructures the convolution kernel to rely solely on butterfly operations, enhancing arithmetic intensity and memory access.
  • Empirical benchmarks demonstrate up to 3x speedup in 1D and consistent gains in higher dimensions, making it ideal for memory-bound applications.

Permutation-avoiding FFT-based convolution is a computational approach that eliminates runtime index-reversal permutations in fast Fourier transform (FFT)-based convolution algorithms by deferring all permutation operations to an offline step applied to a fixed filter. This procedure is particularly relevant for repeated convolutions with a fixed filter, as it maximizes arithmetic intensity and substantially improves memory access patterns. The key innovation is to restructure the convolution kernel such that only butterfly operations are required at runtime, removing the primary memory-bound bottleneck of standard FFT implementations and enabling significant speedups in high-dimensional, large-scale convolution tasks (Venkovic et al., 15 Jun 2025).

1. Standard Cooley–Tukey FFT and the Role of Permutations

The one-dimensional discrete Fourier transform (DFT) of length NN is given by

X[k]=n=0N1x[n]  e2πikn/N,X[k] = \sum_{n=0}^{N-1} x[n]\;e^{-2\pi i\,kn/N},

which can be expressed in matrix form as X=FNxX = F_N x, where (FN)k,n=ωNkn(F_N)_{k,n} = \omega_N^{kn}, ωN=e2πi/N\omega_N = e^{-2\pi i/N}. The inverse DFT satisfies x=FN1Xx = F_N^{-1}X, with FN1=FN/NF_N^{-1} = \overline{F_N}/N.

The radix-rr Cooley–Tukey algorithm, assuming N=rtN = r^t, factors the DFT matrix as FN=Ar,NPr,NF_N = A_{r,N} P_{r,N}, where Ar,NA_{r,N} is a product of “butterfly” blocks applied in a sequence, and Pr,NP_{r,N} is a symmetric, involutive index-reversal permutation matrix (“modulo-rr sort” or digit-reordering). While butterfly blocks (Br,kB_{r,k}) exploit data reuse and structured memory access, permutations (Pr,NP_{r,N}) induce stride-based, non-local memory patterns after each FFT stage. This degrades arithmetic intensity, making standard FFT-based convolutions memory-bound rather than compute-bound (Venkovic et al., 15 Jun 2025).

2. Permutation-Avoiding Convolution: Theory and Reformulation

In standard FFT-based convolution, the convolution operator is implemented as

Fg(x)=FN1diag(g)FNx,\mathcal F_g(x) = F_N^{-1} \mathrm{diag}(g) F_N x,

where gg is the Fourier transform of the convolution filter. By expanding the DFT and its inverse using Cooley–Tukey factorization FN=APF_N = A P and FN1=AP/NF_N^{-1} = \overline{A P} / N, the convolution operator becomes

Fg(x)=1NAPdiag(g)PATx=1NAdiag(Pg)ATx=1NA(g^(ATx)).\mathcal F_g(x) = \frac{1}{N} \overline{A P \mathrm{diag}(g) P A^T} x = \frac{1}{N} \overline{A\,\mathrm{diag}(P g)\,A^T} x = \frac{1}{N} \overline{A\,(\widehat{g} \circ (A^T x))}.

Here, g^=Pg\widehat{g} = P g denotes the permuted filter spectrum, and “\circ” denotes the Hadamard (elementwise) product. Importantly, the permutation matrix and corresponding data reordering are applied only once, offline, to the filter gg. The online procedure for each new input consists solely of:

  1. Unordered forward FFT: y=ATxy = A^T x,
  2. Elementwise multiplication: yg^yy \leftarrow \widehat{g} \circ y,
  3. Unordered inverse FFT: output 1NAy\frac{1}{N} \overline{A y}.

All runtime index-reversal permutations are thus eliminated. This reorganization is valid due to the symmetry of the permutation matrix PP, and the involutive property PT=P=P1P^T = P = P^{-1} (Venkovic et al., 15 Jun 2025).

3. Multi-Dimensional Implementation and Data Access Patterns

The permutation-avoiding reformulation generalizes to dd-dimensional convolutions. For tensor-shaped data XCn1××nd\mathscr X \in \mathbb C^{n_1 \times \cdots \times n_d} and a fixed filter transform G\mathscr G, define

An=Ar,ndAr,n1,A_{\mathbf n} = A_{r, n_d} \otimes \cdots \otimes A_{r, n_1},

where \otimes is the Kronecker product. Offline, compute the permuted filter vector

g^=(Pr,ndPr,n1)vec(G).\widehat{g} = (P_{r, n_d} \otimes \cdots \otimes P_{r, n_1})\,\mathrm{vec}(\mathscr G).

At runtime, for each input X\mathscr X:

  1. Vectorize: xvec(X)x \leftarrow \mathrm{vec}(\mathscr X).
  2. Forward butterfly: y(An)Txy \leftarrow (A_{\mathbf n})^T x.
  3. Pointwise multiply: yg^yy \leftarrow \widehat g \circ y.
  4. Inverse butterfly: zAnyz \leftarrow \overline{A_{\mathbf n}\,y}.
  5. Output: vec1(z)/(n1nd)\mathrm{vec}^{-1}(z)/(n_1 \cdots n_d).

Butterfly operations proceed in block-stride order without data permutations, enabling regular memory accesses and high register reuse. The pointwise multiplication is fully vectorizable, further improving computational efficiency (Venkovic et al., 15 Jun 2025).

4. Computational Complexity, Memory Traffic, and Arithmetic Intensity

Floating-Point Operations

  • Each radix-rr 1D unordered FFT requires αrNlogrN\alpha_r N \log_r N floating-point operations, where numerically α2=5\alpha_2 = 5, α4=4.25\alpha_4 = 4.25, α8=4.08\alpha_8 = 4.08.
  • In dd dimensions, the operation count is q=1dαrNlogrnq\sum_{q=1}^d \alpha_r N \log_r n_q.

Memory Traffic

  • Standard 1D FFT: Each permutation requires $2N$ data accesses, and each butterfly stage requires $2N$, resulting in a total of 2N(1+logrN)2N(1+\log_r N) accesses.
  • Permutation-avoiding 1D FFT: Only logrN\log_r N butterfly stages are required at runtime, yielding 2NlogrN2N\log_r N accesses.

Arithmetic Intensity

Arithmetic intensity is defined as the ratio of floating-point operations to bytes moved (for complex doubles, 16 bytes per access): Ir(N)=αrNlogrN162N(1+logrN)=αr32logrN1+logrN.I_r(N) = \frac{\alpha_r N \log_r N}{16 \cdot 2N(1+\log_r N)} = \frac{\alpha_r}{32} \frac{\log_r N}{1+\log_r N}. As NN \to \infty, Irαr/32I_r \to \alpha_r / 32. Sample values:

  • Radix 2: 0.156 flops/byte
  • Radix 4: 0.266 flops/byte
  • Radix 8: 0.383 flops/byte

Permutation-avoiding kernels remove the 1+logrN1+\log_r N term in the denominator, so arithmetic intensity reaches the theoretical limit even for moderate NN (Venkovic et al., 15 Jun 2025).

5. Empirical Benchmarks and Performance Gains

Extensive benchmarking on two Intel systems compared permutation-avoiding kernels (“PA”) to a general-radix FFT implementation and FFTW with FFTW_ESTIMATE. Highlights:

Task Method Time (s) Speedup vs. Standard Speedup vs. FFTW_ESTIMATE
1D (N=224N=2^{24}, radix 4) Standard (perm-full) 1.055 1.00 0.92
Perm-avoiding (PA) 0.505 2.09× 1.85×
2D (212×2122^{12} \times 2^{12}, radix 4) Standard 2.200 1.00
Perm-avoiding (PA) 2.071 1.06×
FFTW_ESTIMATE 1.759 1.25×
3D (28×28×282^8 \times 2^8 \times 2^8, radix 2) Standard 1.734 1.00
Perm-avoiding (PA) 1.664 1.04×
FFTW_ESTIMATE 1.988 0.87×
  • For large 1D convolutions, speedups approach 3×3\times compared to standard convolution (where permutation cost dominates total time).
  • In higher dimensions, speedups are consistently in the 5%5\%10%10\% range, attributed to improved spatial locality.
  • Higher radices exhibit similar trends, with optimal benefit in regimes where the butterfly cost remains memory-bound (Venkovic et al., 15 Jun 2025).

6. Recommendations and Application Context

Practical adoption entails the following:

  • Libraries should provide an API for “fixed-filter convolution” that receives a pre-permuted filter spectrum g^\widehat g.
  • The internal workflow should be divided into an unordered forward butterfly (ATA^T), elementwise multiply, and unordered inverse butterfly (A/N\overline{A}/N).
  • For mixed-radix and prime-factor FFT algorithms, the required permutation is generalized via Pρ,NP_{\rho, N} or, for prime-factor cases, by composing Kronecker permutations and a permutation ΥT\Upsilon^T.
  • The technique applies only to scenarios involving repeated convolution with an invariant filter, as a one-time O(N)O(N) offline permutation must be performed on the filter.
  • Further optimizations include cache blocking (four-step), multi-threading, SIMD fusion, and GPU memory coalescing.

This approach is suited for production FFT libraries in domains where convolution with a fixed filter is common and large-scale, memory-bound scenarios dominate computational cost. Deferring index-reversal permutations to an offline filter rearrangement yields a permutation-free, memory-efficient online convolution kernel that maximizes arithmetic intensity (Venkovic et al., 15 Jun 2025).

7. Limitations and Extension Prospects

Permutation-avoiding FFT-based convolution requires that the convolution be applied multiple times with a constant filter, as the only runtime-avoided cost is in the repeated evaluation stage; the initial offline permutation is linear in the data size. While this method is highly effective for memory-bound FFT scenarios, it is not universally optimal: for one-off convolutions or with frequently changing filters, the standard approach may remain preferable. Potential enhancements involve combining permutation-avoiding techniques with established optimizations such as cache-blocked algorithms, multi-threading, SIMD twiddle fusion, and adaptation for GPU memory infrastructures. A plausible implication is that this paradigm could be beneficially incorporated into library APIs and hardware-oriented FFT frameworks, though further empirical studies may be required to refine its applicability across diverse architectures (Venkovic et al., 15 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Permutation-Avoiding FFT-Based Convolution.