FWHT: Fast Walsh–Hadamard Transform

Updated 7 April 2026

FWHT is an efficient, in-place O(N log N) algorithm for computing the Walsh–Hadamard transform using only real arithmetic and no multiplications.
Its simple butterfly recursion and hardware-friendly operations make FWHT a key tool in randomized numerical linear algebra, signal processing, and deep learning.
Advanced versions leverage sparsity, lookup-based optimizations, and quantum techniques to reduce complexity and enhance performance in large-scale computations.

The Fast Walsh–Hadamard Transform (FWHT) is an efficient, in-place $O(N\log N)$ algorithm for computing the linear transformation associated with the Walsh–Hadamard matrix, a real, binary, orthogonal matrix with entries $\pm1$ . The FWHT underpins a broad spectrum of applications in randomized numerical linear algebra, signal processing, sparse transform computation, quantum information, deep learning, and hardware acceleration, where its structure—no multiplications or complex arithmetic, unity modulus kernel—allows for extreme computational efficiency and algorithmic simplicity.

1. Formal Definition and Algorithmic Structure

The Walsh–Hadamard matrix of order $n=2^m$ is defined recursively as

$H_1 = [1], \quad H_{2n} = \begin{bmatrix} H_n & H_n \ H_n & -H_n \end{bmatrix}, \quad H_{2^m} = H_2^{\otimes m}$

where $\otimes$ denotes the Kronecker product. The unnormalized FWHT of $x \in \mathbb{R}^n$ is $y = H_n x$ . Normalization, dividing by $\sqrt{n}$ , yields an orthonormal transform, but for most randomized algorithms (e.g., sketching) the absolute scaling is immaterial.

The canonical FWHT algorithm is a Cooley–Tukey style “butterfly” recursion, operating in-place:

$y = H_n x$ 6 This routine completes in $\Theta(n\log n)$ additions/subtractions with $O(1)$ extra space, containing only real arithmetic and no multiplications. The transform is involutive up to scale: a second application of FWHT restores the input (modulo scale) (Andersson et al., 14 Jan 2026).

2. Sparse and Sublinear-Complexity FWHT

For signals $\pm1$ 0 with $\pm1$ 1 ( $\pm1$ 2) nonzero Hadamard coefficients, classical $\pm1$ 3 runtime is suboptimal. A rigorous sparse FWHT achieves $\pm1$ 4 runtime and $\pm1$ 5 sample complexity under random support by leveraging hierarchical time-domain subsampling schemes. Subsampled windows induce spectrum aliasing, and multiple random permutations (“hashes”) organize spectral coefficients into bins. The resulting decoding is cast as a sparse-graph code peeling process: singleton bins unmask support, and density evolution predicts successful recovery with high probability for $\pm1$ 6 hash repetitions (Scheibler et al., 2013).

Further, robustification to noise (SPRIGHT) demands only a constant-factor increase in samples and identical computational complexity. The binary-valued nature of the kernel transforms noise detection into binary symmetric channel decoding; no additional logarithmic factor is necessary for sample size. Noise-aware singleton detection via majority-vote or LDPC-like codewords ensures support recovery and amplitude estimation in $\pm1$ 7 time (Li et al., 2015).

3. Compressed Matrix Multiplication and Sketching

FWHT enables efficient sketching of matrix-matrix products in randomized numerical linear algebra, notably as a drop-in replacement for the FFT in Pagh’s compressed matrix multiplication framework. Given $\pm1$ 8, one constructs $\pm1$ 9 independent sketches into $n=2^m$ 0 buckets using 2-wise independent hash and sign functions. The workflow is:

Hash and sign rows and columns: $n=2^m$ 1, analogously for $n=2^m$ 2.
FWHT each hash bucket vector: $n=2^m$ 3.
Accumulate outer products: $n=2^m$ 4.
Entry recovery via de-hashing, sign correction, and median-of-means estimator.

Variance and concentration guarantees are inherited from FFT-based sketching, with $n=2^m$ 5. Empirical results on 64-core CPUs show FWHT-based sketching outpaces FFT by up to $n=2^m$ 6 and, under heavy sparsity and magnitude skew, can outperform MKL DGEMM by $n=2^m$ 7 (Andersson et al., 14 Jan 2026).

4. Advanced Architectures: Hardware Acceleration and Algorithmic Improvements

Several recent advances focus on exploiting hardware capabilities and lowering operation counts:

Tensor Core Acceleration: HadaCore leverages NVIDIA tensor cores by implementing radix-16 Hadamard blocks using fused $n=2^m$ 8 matmuls, reducing thread synchronizations, and optimizing memory layouts. Despite doubling the nominal FLOP count, peak speedups of $n=2^m$ 9– $H_1 = [1], \quad H_{2n} = \begin{bmatrix} H_n & H_n \ H_n & -H_n \end{bmatrix}, \quad H_{2^m} = H_2^{\otimes m}$ 0 over standard CUDA FWHT kernels are realized on A100/H100 GPUs, with no loss in end-to-end accuracy for LLM inference in BF16/FP16. For $H_1 = [1], \quad H_{2n} = \begin{bmatrix} H_n & H_n \ H_n & -H_n \end{bmatrix}, \quad H_{2^m} = H_2^{\otimes m}$ 1K, sub-30 $H_1 = [1], \quad H_{2n} = \begin{bmatrix} H_n & H_n \ H_n & -H_n \end{bmatrix}, \quad H_{2^m} = H_2^{\otimes m}$ 2s transforms on 8M elements are achieved. The architecture preserves $H_1 = [1], \quad H_{2n} = \begin{bmatrix} H_n & H_n \ H_n & -H_n \end{bmatrix}, \quad H_{2^m} = H_2^{\otimes m}$ 3 complexity (Agarwal et al., 2024).
Bit-Complexity Reduction via Lookup Tables: For finite fields $H_1 = [1], \quad H_{2n} = \begin{bmatrix} H_n & H_n \ H_n & -H_n \end{bmatrix}, \quad H_{2^m} = H_2^{\otimes m}$ 4, precomputing all transforms on length- $H_1 = [1], \quad H_{2n} = \begin{bmatrix} H_n & H_n \ H_n & -H_n \end{bmatrix}, \quad H_{2^m} = H_2^{\otimes m}$ 5 vectors and deploying Yates’s Kronecker-power algorithm reduces bit-complexity to $H_1 = [1], \quad H_{2n} = \begin{bmatrix} H_n & H_n \ H_n & -H_n \end{bmatrix}, \quad H_{2^m} = H_2^{\otimes m}$ 6. This “blockwise table” strategy collapses multiple butterfly levels per lookup, achieving a superconstant (albeit subpolynomial) speedup in models where lookups are cheap (Alman, 2022).
Non-Rigidity-based Algorithmic Improvement: By decomposing $H_1 = [1], \quad H_{2n} = \begin{bmatrix} H_n & H_n \ H_n & -H_n \end{bmatrix}, \quad H_{2^m} = H_2^{\otimes m}$ 7 into a rank-1 matrix and a sparse matrix, Alman–Rao’s FWHT achieves $H_1 = [1], \quad H_{2n} = \begin{bmatrix} H_n & H_n \ H_n & -H_n \end{bmatrix}, \quad H_{2^m} = H_2^{\otimes m}$ 8 operation count, the first constant-factor improvement over the folklore $H_1 = [1], \quad H_{2n} = \begin{bmatrix} H_n & H_n \ H_n & -H_n \end{bmatrix}, \quad H_{2^m} = H_2^{\otimes m}$ 9 (Alman et al., 2022).

5. Applications in Signal Processing, Quantum Information, and Deep Learning

FWHT arises in diverse application verticals:

MIMO-OFDM Equalization: In multi-carrier modulation systems, FWHT replaces FFT for block diagonalization and frequency-domain equalization, especially in combination with Banded Matrix Approximation (BMA) strategies. Against full matrix compensation, block diagonalization via FWHT achieves complexity reductions of $\otimes$ 0-- $\otimes$ 1, while maintaining BER within $\otimes$ 2– $\otimes$ 3 dB of MMSE-SIC. Robustness to co-CFO and frequency-selective fading observed (Ramadan, 2023).
Quantum Pauli Decomposition: For $\otimes$ 4 matrices ( $\otimes$ 5), FWHT enables computation of all $\otimes$ 6 Pauli expansion coefficients in $\otimes$ 7 and $\otimes$ 8 extra memory. A key step: XOR-permutation of matrix entries followed by per-row FWHT and diagonal phase correction. This procedure outperforms previous methods in explicit decomposition workflows (Georges et al., 2024).
Stabilizer Rényi Entropy (Quantum Magic): The XOR-convolution property of the FWHT (group $\otimes$ 9) reduces brute-force $x \in \mathbb{R}^n$ 0 evaluation of the second-order Rényi entropy to $x \in \mathbb{R}^n$ 1 via $x \in \mathbb{R}^n$ 2 FWHTs of length $x \in \mathbb{R}^n$ 3. All transforms and accumulations are in-place and parallelizable, enabling medium-scale exact calculations in quantum simulation (Huang et al., 31 Dec 2025).
Deep Neural Networks: FWHT-based layers (both 1D and 2D) replace $x \in \mathbb{R}^n$ 4 and $x \in \mathbb{R}^n$ 5 convolutional layers. “Block” FWHT with trainable smooth-thresholding in the transform domain achieves $x \in \mathbb{R}^n$ 6 speedup, $x \in \mathbb{R}^n$ 7 RAM reduction, and up to $x \in \mathbb{R}^n$ 8 parameter reduction with $x \in \mathbb{R}^n$ 9 accuracy loss on embedded devices. Residual 2D FWHT blocks can also improve final model accuracy (Pan et al., 2022).
Image Registration and Local Structure Encoding: Patchwise FWHT basis coefficients compactly encode local edge and corner information, and contribute to improved registration metrics (4%–4.4% better MI/CC) and $y = H_n x$ 0– $y = H_n x$ 1 speedups in medical image registration (Sasikala et al., 2010).

6. Hybrid and Quantum-Accelerated Algorithms

Hybrid classical-quantum FWHTs achieve $y = H_n x$ 2 complexity under assumptions on state preparation cost. The method encodes the input vector as a quantum state, applies a Hadamard layer, and exploits measurement statistics to recover the spectrum, with one measurement per output required. Applied to 2D polar image representations, the total complexity can be lowered from $y = H_n x$ 3 to $y = H_n x$ 4, provided efficient state preparation and measurement (Rohida et al., 2024).

7. Implementation, Parallelism, and Practical Notes

Pragmatically, FWHT’s key implementation strengths are in-place operation, perfect SIMD-compatibility (all operations are addition/subtraction), avoidance of complex-valued arithmetic, and amenability to both coarse- and fine-grained threading. Integration into high-level languages (NumPy, pybind11), efficient cache-blocked memory access, and lock-minimization on parallel sketches are routine thanks to the absence of multiplies and constant stride memory access (Andersson et al., 14 Jan 2026, Agarwal et al., 2024). On modern hardware, kernel-level optimization (e.g., Tensor Core fusion, lookup-table prefetching, tiling) is central to attaining peak throughput.

In summary, the Fast Walsh–Hadamard Transform is a structurally simple, computationally optimal $y = H_n x$ 5 real-linear transform with broad relevance across theory and practice. Advanced variants exploit sparsity, hardware parallelism, lookup acceleration, quantum subroutines, and algebraic properties to push constant factors, memory, and sampling efficiency to their practical and theoretical minima. The recent literature establishes FWHT as a pivotal primitive for compressed computation, scalable quantum-classical algorithms, and resource-efficient machine learning (Andersson et al., 14 Jan 2026, Georges et al., 2024, Scheibler et al., 2013, Li et al., 2015, Agarwal et al., 2024, Pan et al., 2022, Alman, 2022, Alman et al., 2022, Sasikala et al., 2010, Rohida et al., 2024, Huang et al., 31 Dec 2025, Ramadan, 2023).