Permutation-Avoiding FFT Convolution

Updated 12 February 2026

The paper introduces a method that defers offline permutation on fixed filters, eliminating runtime reordering in FFT-based convolution.
The approach restructures the convolution kernel to rely solely on butterfly operations, enhancing arithmetic intensity and memory access.
Empirical benchmarks demonstrate up to 3x speedup in 1D and consistent gains in higher dimensions, making it ideal for memory-bound applications.

Permutation-avoiding FFT-based convolution is a computational approach that eliminates runtime index-reversal permutations in fast Fourier transform (FFT)-based convolution algorithms by deferring all permutation operations to an offline step applied to a fixed filter. This procedure is particularly relevant for repeated convolutions with a fixed filter, as it maximizes arithmetic intensity and substantially improves memory access patterns. The key innovation is to restructure the convolution kernel such that only butterfly operations are required at runtime, removing the primary memory-bound bottleneck of standard FFT implementations and enabling significant speedups in high-dimensional, large-scale convolution tasks (Venkovic et al., 15 Jun 2025).

1. Standard Cooley–Tukey FFT and the Role of Permutations

The one-dimensional discrete Fourier transform (DFT) of length $N$ is given by

$X[k] = \sum_{n=0}^{N-1} x[n]\;e^{-2\pi i\,kn/N},$

which can be expressed in matrix form as $X = F_N x$ , where $(F_N)_{k,n} = \omega_N^{kn}$ , $\omega_N = e^{-2\pi i/N}$ . The inverse DFT satisfies $x = F_N^{-1}X$ , with $F_N^{-1} = \overline{F_N}/N$ .

The radix- $r$ Cooley–Tukey algorithm, assuming $N = r^t$ , factors the DFT matrix as $F_N = A_{r,N} P_{r,N}$ , where $A_{r,N}$ is a product of “butterfly” blocks applied in a sequence, and $P_{r,N}$ is a symmetric, involutive index-reversal permutation matrix (“modulo- $r$ sort” or digit-reordering). While butterfly blocks ( $B_{r,k}$ ) exploit data reuse and structured memory access, permutations ( $P_{r,N}$ ) induce stride-based, non-local memory patterns after each FFT stage. This degrades arithmetic intensity, making standard FFT-based convolutions memory-bound rather than compute-bound (Venkovic et al., 15 Jun 2025).

2. Permutation-Avoiding Convolution: Theory and Reformulation

In standard FFT-based convolution, the convolution operator is implemented as

$\mathcal F_g(x) = F_N^{-1} \mathrm{diag}(g) F_N x,$

where $g$ is the Fourier transform of the convolution filter. By expanding the DFT and its inverse using Cooley–Tukey factorization $F_N = A P$ and $F_N^{-1} = \overline{A P} / N$ , the convolution operator becomes

$\mathcal F_g(x) = \frac{1}{N} \overline{A P \mathrm{diag}(g) P A^T} x = \frac{1}{N} \overline{A\,\mathrm{diag}(P g)\,A^T} x = \frac{1}{N} \overline{A\,(\widehat{g} \circ (A^T x))}.$

Here, $\widehat{g} = P g$ denotes the permuted filter spectrum, and “ $\circ$ ” denotes the Hadamard (elementwise) product. Importantly, the permutation matrix and corresponding data reordering are applied only once, offline, to the filter $g$ . The online procedure for each new input consists solely of:

Unordered forward FFT: $y = A^T x$ ,
Elementwise multiplication: $y \leftarrow \widehat{g} \circ y$ ,
Unordered inverse FFT: output $\frac{1}{N} \overline{A y}$ .

All runtime index-reversal permutations are thus eliminated. This reorganization is valid due to the symmetry of the permutation matrix $P$ , and the involutive property $P^T = P = P^{-1}$ (Venkovic et al., 15 Jun 2025).

3. Multi-Dimensional Implementation and Data Access Patterns

The permutation-avoiding reformulation generalizes to $d$ -dimensional convolutions. For tensor-shaped data $\mathscr X \in \mathbb C^{n_1 \times \cdots \times n_d}$ and a fixed filter transform $\mathscr G$ , define

$A_{\mathbf n} = A_{r, n_d} \otimes \cdots \otimes A_{r, n_1},$

where $\otimes$ is the Kronecker product. Offline, compute the permuted filter vector

$\widehat{g} = (P_{r, n_d} \otimes \cdots \otimes P_{r, n_1})\,\mathrm{vec}(\mathscr G).$

At runtime, for each input $\mathscr X$ :

Vectorize: $x \leftarrow \mathrm{vec}(\mathscr X)$ .
Forward butterfly: $y \leftarrow (A_{\mathbf n})^T x$ .
Pointwise multiply: $y \leftarrow \widehat g \circ y$ .
Inverse butterfly: $z \leftarrow \overline{A_{\mathbf n}\,y}$ .
Output: $\mathrm{vec}^{-1}(z)/(n_1 \cdots n_d)$ .

Butterfly operations proceed in block-stride order without data permutations, enabling regular memory accesses and high register reuse. The pointwise multiplication is fully vectorizable, further improving computational efficiency (Venkovic et al., 15 Jun 2025).

4. Computational Complexity, Memory Traffic, and Arithmetic Intensity

Floating-Point Operations

Each radix- $r$ 1D unordered FFT requires $\alpha_r N \log_r N$ floating-point operations, where numerically $\alpha_2 = 5$ , $\alpha_4 = 4.25$ , $\alpha_8 = 4.08$ .
In $d$ dimensions, the operation count is $\sum_{q=1}^d \alpha_r N \log_r n_q$ .

Memory Traffic

Standard 1D FFT: Each permutation requires $2N$ data accesses, and each butterfly stage requires $2N$, resulting in a total of $2N(1+\log_r N)$ accesses.
Permutation-avoiding 1D FFT: Only $\log_r N$ butterfly stages are required at runtime, yielding $2N\log_r N$ accesses.

Arithmetic Intensity

Arithmetic intensity is defined as the ratio of floating-point operations to bytes moved (for complex doubles, 16 bytes per access): $I_r(N) = \frac{\alpha_r N \log_r N}{16 \cdot 2N(1+\log_r N)} = \frac{\alpha_r}{32} \frac{\log_r N}{1+\log_r N}.$ As $N \to \infty$ , $I_r \to \alpha_r / 32$ . Sample values:

Radix 2: 0.156 flops/byte
Radix 4: 0.266 flops/byte
Radix 8: 0.383 flops/byte

Permutation-avoiding kernels remove the $1+\log_r N$ term in the denominator, so arithmetic intensity reaches the theoretical limit even for moderate $N$ (Venkovic et al., 15 Jun 2025).

5. Empirical Benchmarks and Performance Gains

Extensive benchmarking on two Intel systems compared permutation-avoiding kernels (“PA”) to a general-radix FFT implementation and FFTW with FFTW_ESTIMATE. Highlights:

Task	Method	Time (s)	Speedup vs. Standard	Speedup vs. FFTW_ESTIMATE
1D ( $N=2^{24}$ , radix 4)	Standard (perm-full)	1.055	1.00	0.92
	Perm-avoiding (PA)	0.505	2.09×	1.85×
2D ( $2^{12} \times 2^{12}$ , radix 4)	Standard	2.200	1.00
	Perm-avoiding (PA)	2.071	1.06×
	FFTW_ESTIMATE	1.759	1.25×
3D ( $2^8 \times 2^8 \times 2^8$ , radix 2)	Standard	1.734	1.00
	Perm-avoiding (PA)	1.664	1.04×
	FFTW_ESTIMATE	1.988	0.87×

For large 1D convolutions, speedups approach $3\times$ compared to standard convolution (where permutation cost dominates total time).
In higher dimensions, speedups are consistently in the $5\%$ – $10\%$ range, attributed to improved spatial locality.
Higher radices exhibit similar trends, with optimal benefit in regimes where the butterfly cost remains memory-bound (Venkovic et al., 15 Jun 2025).

6. Recommendations and Application Context

Practical adoption entails the following:

Libraries should provide an API for “fixed-filter convolution” that receives a pre-permuted filter spectrum $\widehat g$ .
The internal workflow should be divided into an unordered forward butterfly ( $A^T$ ), elementwise multiply, and unordered inverse butterfly ( $\overline{A}/N$ ).
For mixed-radix and prime-factor FFT algorithms, the required permutation is generalized via $P_{\rho, N}$ or, for prime-factor cases, by composing Kronecker permutations and a permutation $\Upsilon^T$ .
The technique applies only to scenarios involving repeated convolution with an invariant filter, as a one-time $O(N)$ offline permutation must be performed on the filter.
Further optimizations include cache blocking (four-step), multi-threading, SIMD fusion, and GPU memory coalescing.

This approach is suited for production FFT libraries in domains where convolution with a fixed filter is common and large-scale, memory-bound scenarios dominate computational cost. Deferring index-reversal permutations to an offline filter rearrangement yields a permutation-free, memory-efficient online convolution kernel that maximizes arithmetic intensity (Venkovic et al., 15 Jun 2025).

7. Limitations and Extension Prospects

Permutation-avoiding FFT-based convolution requires that the convolution be applied multiple times with a constant filter, as the only runtime-avoided cost is in the repeated evaluation stage; the initial offline permutation is linear in the data size. While this method is highly effective for memory-bound FFT scenarios, it is not universally optimal: for one-off convolutions or with frequently changing filters, the standard approach may remain preferable. Potential enhancements involve combining permutation-avoiding techniques with established optimizations such as cache-blocked algorithms, multi-threading, SIMD twiddle fusion, and adaptation for GPU memory infrastructures. A plausible implication is that this paradigm could be beneficially incorporated into library APIs and hardware-oriented FFT frameworks, though further empirical studies may be required to refine its applicability across diverse architectures (Venkovic et al., 15 Jun 2025).

Markdown Upgrade to Chat

References (1)

Permutation-Avoiding FFT-Based Convolution (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Permutation-Avoiding FFT-Based Convolution.