FFT-Based Block Diagonalization

Updated 12 April 2026

FFT-based block diagonalization is a method using DFT and its extensions to decouple structured matrices like circulant, Toeplitz, and tensor forms for efficient computation.
The approach employs recursive algorithms such as the Split FFT, which reduce memory usage and enable parallel processing across independent subproblems.
Extensions to tensor and non-commutative algebras, including quaternion and octonion transforms, allow the technique to support complex applications in numerical linear algebra and PDE solvers.

FFT-based block diagonalization refers to a diverse set of techniques leveraging the algebraic properties of the discrete Fourier transform (DFT) and its generalizations to achieve efficient block diagonalization or decoupling of structured operators and matrices, typically with circulant, Toeplitz, block-circulant, or tensor product structure. These methodologies underpin a wide range of numerical and computational algorithms, particularly for high-dimensional linear algebra, PDE solvers, tensor analysis, and machine learning models. The FFT (Fast Fourier Transform) provides the computational backbone, transforming globally coupled linear operators into block or fully diagonal forms amenable to parallel, low-memory, and fast arithmetic.

1. Core Principles and Mathematical Foundations

Many structured matrices, particularly circulant and block Toeplitz matrices, are amenable to exact diagonalization or block diagonalization under conjugation by suitably designed Fourier-type matrices. In the prototypical complex circulant case,

$C x = F^{-1}\Lambda(Fx)$

where $C$ is an $n\times n$ circulant matrix, $F$ is the unitary DFT matrix, and $\Lambda$ is diagonal with entries being the DFT of $C$ 's first column. This diagonalizes $C$ , reducing matrix-vector multiplication to $O(n\log n)$ operations.

For more general block or multi-level structures, the DFT is extended via Kronecker products or tailored transforms (e.g., multidimensional FFTs, quaternion/octonion DFTs), block-diagonalizing the operator and reducing computational complexity. This principle generalizes to block circulant matrices with blocks in $\mathbb{C}^{m\times n}$ , quaternion- and tensor-valued operators, and structured multidimensional arrays.

The diagonalizability of circulant structures by Fourier transforms is central: for circulant matrices over complex numbers, the DFT provides a full eigenbasis; for block Toeplitz or block circulant matrices, multidimensional FFTs or tensorized transforms yield block-diagonal (not fully diagonal) forms. For non-commutative structures (e.g., quaternions), only block diagonalization is generally possible, often requiring permutation or extension into higher algebras (octonions) (Zheng et al., 2022, Pan et al., 2023, Zhang et al., 12 Feb 2026).

2. FFT-Based Block Diagonalization Algorithms

Toeplitz and Block Toeplitz Structures

For $d$ -level block Toeplitz matrices $C$ 0, the classical approach is "circulant embedding": each dimension is extended to size $C$ 1 by zero-padding, producing a $C$ 2-dimensional block-circulant matrix that can be diagonalized by an appropriate multidimensional FFT. The action reduces to diagonal multiplications in the frequency domain and two FFTs. This approach asymptotically costs $C$ 3 arithmetic and $C$ 4 memory (Siron et al., 2024).

Recent algorithms, notably the "Split FFT" or "lazy embedding, eager projection" scheme, circumvent the full circulant embedding via recursive, dimensionwise even/odd splitting combined with judicious discarding of zeros and phase correction. At each level, two $C$ 5-shaped branches (even and odd) are handled recursively without materializing the larger $C$ 6 vectors:

Apply a 1D FFT along an active dimension
Split into "even" and "odd" branches using phase-shifting (diagonal operator $C$ 7)
Recursively process each branch, combine results by inverse FFT and merging
At the leaves, apply pointwise Toeplitz multipliers

This method yields vector storage $C$ 8 and computational costs proportional to

$C$ 9

with further reductions for symmetric/skew-symmetric systems (Siron et al., 2024).

Parallelization

The recursive "branching" (even vs. odd) at each dimension naturally exposes independent subproblems. These can be assigned to independent threads, cores, or devices, and batched FFTs can be applied. Merging of branches introduces minor dependencies, but the overall approach allows for effective scaling, with parallelization strategies trade-off between minimal memory and maximal concurrency (Siron et al., 2024).

Tensor and Quaternion Extensions

Tensor-based and quaternionic data structures necessitate further generalization due to non-commutativity and non-diagonalizability. For block circulant quaternion matrices $n\times n$ 0, the standard DFT fails to diagonalize $n\times n$ 1 due to the algebraic structure of $n\times n$ 2. Here, block diagonalization is achieved using specialized quaternion DFTs (QFFT) and permutation matrices $n\times n$ 3 to convert the transformed operator into block-diagonal form with $n\times n$ 4 and $n\times n$ 5 quaternion blocks. The resulting algorithm for inversion or SVD uses FFTs for rapid transformation, with complexity $n\times n$ 6 per block (Pan et al., 2023, Zheng et al., 2022).

When considering block circulant matrices with block structure or higher-order tensors (e.g., $n\times n$ 7), Kronecker or tensor FFTs and, in some cases, extension into the octonion algebra provide diagonalizing transforms:

For quaternion tensors, FFT block-diagonalizes the frontal slices, reducing the computation of products or inversions to independent operations on smaller matrices (Zhang et al., 12 Feb 2026).
Octonion DFTs resolve block diagonality for cases where quaternion DFTs are insufficient, enabling $n\times n$ 8 block diagonalization (Zheng et al., 2022).

The table below summarizes selected FFT-based block diagonalization schemes:

Structure	Transform	Block Form	Complexity
Complex circulant	DFT	Diagonal	$n\times n$ 9
Multi-level block Toeplitz	Multi-D FFT, Split-FFT	Block diagonal (size $F$ 0 per block)	$F$ 1
Quaternion circulant	QFFT + perm.	$F$ 2, $F$ 3 blocks	$F$ 4
Block circulant quaternion	Octonion DFT	Full diagonal (in $F$ 5)	$F$ 6
Tensor / block circulant	Kronecker, tensor FFT	Block diagonal (per frontal slice)	$F$ 7

3. Applications in Numerical Linear Algebra and PDE Solvers

FFT-based block diagonalization is a foundational technique in high-performance direct and iterative solvers for structured linear systems. A notable application is in incompressible flow simulations, where pressure Poisson equations must be solved rapidly and repeatedly. In the context of multi-block finite-difference discretizations, block diagonalization via FFT along homogeneous grid directions reduces $F$ 8D coupled systems to batches of $F$ 9D problems (e.g., decoupled Helmholtz equations) (Costa, 2021). Modewise decoupling enables independent subprobem solution (well-suited for parallel and GPU hardware), with observed speedup factors of $\Lambda$ 0-- $\Lambda$ 1 and strong scaling up to $\Lambda$ 2 cores for $\Lambda$ 3 grids.

Analogous ideas underpin efficient calculation of tensor contractions, T-products, and state space convolutions in signal processing, control, and machine learning (Zheng et al., 2022, Liang et al., 2024, Pan et al., 2023).

4. Block Diagonalization in Tensor and Non-Commutative Algebras

Tensor analysis for multi-modal data (e.g., color video processing) requires fast third-order tensor products. The T-product of two $\Lambda$ 4 and $\Lambda$ 5 quaternion tensors is performed by unfolding to block-circulant forms, applying FFT-based block diagonalization along the third dimension, conducting slice-wise matrix multiplications, and inverse FFT. The result is a reduction in arithmetic and memory complexity by a factor $\Lambda$ 6 versus naive computation (Zheng et al., 2022). The block diagonalization step is critical for extending SVD, LU, and polar decompositions from matrices to tensors and to non-commutative fields such as $\Lambda$ 7 and $\Lambda$ 8 (Zhang et al., 12 Feb 2026, Zheng et al., 2022).

In neural sequence modeling, e.g., the efficient State Space Model (eSSM), block diagonalization of the system matrix enables model decoupling, parameter reduction, and efficient batched convolution via the FFT. For $\Lambda$ 9 MIMO SSMs, diagonalizing or block-diagonalizing $C$ 0 reduces the recursion to $C$ 1 or $C$ 2 independent (or block-coupled) systems, with subsequent convolution accelerated in $C$ 3 or $C$ 4 time, where $C$ 5 is sequence length. This approach yields significant speedup and parameter reductions relative to LSTM or attention-based architectures (Liang et al., 2024).

5. Limitations and Algebraic Obstacles

Although FFT-based block diagonalization is powerful in complex and block circulant settings, intrinsic obstacles emerge in non-commutative or non-associative algebras. In the quaternion case, general circulant matrices cannot be fully diagonalized by any unitary quaternion matrix ( $C$ 6 or its relatives), but only block-diagonalized, necessitating permutations or even octonion-valued transforms for full diagonalization (Zheng et al., 2022, Pan et al., 2023). These algebraic results constrain the class of operators for which FFT-based diagonalization achieves the full spectral decoupling available in complex cases.

Similar issues arise in the structure of the DFT itself: the discrete Fourier transform admits canonical block diagonalization (via the discrete oscillator transform, DOT), with fast $C$ 7 algorithms for the change of basis only in split torus cases (i.e., $C$ 8) (0808.3281). Where the symmetry group or underlying field does not permit enough commuting structure, block structure (not full diagonalization) is optimal.

6. Performance, Scalability, and Empirical Results

Empirical studies across multiple domains confirm the efficiency gains of FFT-based block diagonalization. In DNS solvers for incompressible flow, wall-clock time reductions of $C$ 9 to $C$ 0 have been demonstrated, with excellent scalability and robustness across diverse block topologies (Costa, 2021). In large-scale quaternion/tensor inversion and decomposition, FFT-based algorithms outperform naive inversion by $C$ 1 to $C$ 2 for $C$ 3, with error levels at or below those of dense linear algebra packages (Zhang et al., 12 Feb 2026). In neural sequence modeling, FFT-based block diagonalization enables $C$ 4-- $C$ 5 speedups in convolution step and up to $C$ 6 parameter reduction in learned state matrices, with no observed loss in accuracy (Liang et al., 2024).

7. Extensions and Theoretical Significance

FFT-based block diagonalization unifies perspectives from spectral analysis, operator theory, computational harmonic analysis, and algorithmic design. Beyond its classical applications, the framework extends to tensor-valued data, non-commutative algebras, and the development of new spectral transforms (e.g., the discrete oscillator transform for the DFT) (0808.3281). Algebraic results characterize exactly when diagonalization is possible, while the computational strategies developed (lazy embedding, eager projection, Kronecker FFTs, octonion diagonalizers) provide templates for a wide array of numerically efficient, scalable algorithms across computational mathematics and applied data analysis.