Wavelet-Based Seq & Graph Transformers

Updated 8 April 2026

Wavelet-based Sequence/Graph Transformers are advanced architectures that fuse multiscale wavelet transforms with transformer models to enable precise, localized, and efficient feature extraction.
They replace or augment standard self-attention mechanisms with wavelet-domain operations, reducing computational complexity and capturing hierarchical, multiresolution structures.
Empirical studies in text, graph, biomedical, and robotics tasks demonstrate improved performance and interpretability, despite challenges in scale selection and spectral approximation.

Wavelet-based Sequence and Graph Transformers constitute a family of architectures that integrate multiscale wavelet transforms—either in the classical discrete or spectral graph domain—into transformer models for improved structural representation, multi-resolution analysis, and linear or near-linear complexity. These models aim to address the limitations of standard self-attention, such as quadratic scaling and lack of explicit localization, by replacing, augmenting, or fusing attention with wavelet-domain operations. Their theoretical underpinning is the use of wavelet transforms to obtain bases that jointly achieve sparsity in both space (or node index) and frequency, yielding representations that are naturally suited to the hierarchical, multiscale nature of sequences and graphs.

1. Mathematical Foundations of Wavelet-Based Transforms

Wavelet transforms, in the sequence (classical) case, decompose signals into coefficients at multiple scales and locations using translated and dilated copies of a mother wavelet function $\psi$ . For a 1D signal $x\in\mathbb R^n$ , the discrete wavelet transform (DWT) can be expressed as successive convolutions with low- and high-pass filters, yielding approximation and detail coefficients at each scale. For graphs, wavelet transforms are defined through spectral filtering of the graph Laplacian $L$ (for undirected graphs) or appropriate generalizations based on the random-walk operator for directed graphs. Specifically, the spectral graph wavelet at scale $s$ is constructed as $g_s(L)=U\,g_s(\Lambda)\,U^\top$ , where $L=U\Lambda U^\top$ is the eigendecomposition and $g_s(\cdot)$ is a band-pass filter kernel (e.g., $g_s(\lambda) = e^{-s\lambda}$ ).

This framework allows both pointwise multiplication in the spectral domain (yielding wavelet filters) and efficient inverse transforms for perfect or approximate reconstruction. For directed graphs, harmonic analysis via the random-walk operator supports redundant (frame-based) and decimated (diffusion) wavelet transforms, each supporting perfect reconstruction with varying trade-offs in localization and redundancy. These transforms generalize classical wavelets to structured, non-Euclidean domains and support the formulation of multi-scale feature extraction and mixing layers in deep learning architectures (Kiruluta et al., 9 May 2025, Sevi et al., 2018, Ngo et al., 2023).

2. Architectural Integration in Sequence and Graph Transformers

Wavelet-based transformers exploit the above mathematical tools in several ways:

Direct replacement of self-attention: The Graph Laplacian Wavelet Transformer (GWT) replaces the $O(N^2)$ dot-product attention block with $K\ll N$ learnable graph-wavelet filter banks. Each filter $x\in\mathbb R^n$ 0 is realized as a spectral filter parametrized by a small MLP, applied to node features, and the filtered outputs are spectrally mixed via learned scale-specific mixing vectors. The final representation is a structured, multi-scale fusion of global and local graph modes, with position-wise MLP and residual updates (Kiruluta et al., 9 May 2025).
Wavelet space attention for sequences: In the WavSpA model, the forward DWT projects sequence token embeddings into multiresolution coefficient space, where queries, keys, and values are computed, and attention is performed directly on the wavelet coefficients. Afterward, the inverse DWT reconstructs token embeddings in the time domain. Both fixed and adaptive wavelet bases are supported, including parameterizations by direct scaling coefficients, lattice/orthogonal filter factorization, or lifting schemes (Zhuang et al., 2022).
Wavelet-based positional encoding: Multiresolution Graph Transformers (MGT) and dedicated Positional Encoding modules (WavePE, DyWPE) leverage the multi-scale coefficients from graph or sequence DWTs to encode spatial or temporal localization information. These features are concatenated or added to standard token inputs, modulating the attention mechanism and enhancing modeling of nonstationary, hierarchical, or structured data (Ngo et al., 2023, Irani et al., 12 Feb 2026, Irani et al., 18 Sep 2025).
Mixed time/frequency attention in vision and robotics: In architectures such as the Frequency-Enhanced Wavelet-based Transformer (FEWT) and Multiscale Wavelet Attention (MWA), wavelet-domain features are used in parallel or fused with time-domain features (e.g., via adaptive dynamic weighting). Modules operate on image feature maps, time series, or both; often, 2D DWT decomposes spatial features for computer vision, while 1D DWT handles time series in robotics pipelines (Huang et al., 14 Sep 2025, Nekoozadeh et al., 2023).

3. Multiscale Representation, Localization, and Expressiveness

Wavelet-based transformers endow models with explicit multi-resolution capabilities. Each scale in the wavelet domain corresponds to a particular frequency band (local to global). Empirically and theoretically, wavelet transforms outperform Fourier-based alternatives in scenarios where dynamic, hierarchical, or transient localized correlations are present. For example:

Text and sequence modeling: Wavelet-based attention modules capture token relationships at various levels (characters, words, sentences, paragraphs) and exhibit improved robustness in long-range reasoning and chain-of-reasoning tasks. In Long Range Arena, fixed Daubechies-2 WavSpA achieved 74.8% on the Text task compared to 64.3% for vanilla and 56.4% for Fourier-based attention (Zhuang et al., 2022).
Graph learning: Spectral and diffusion wavelets provide strong spatial and spectral localization, enabling modulation and aggregation of information at different topological scales. The WavePE positional encoding in MGT achieves superior or competitive results compared to Laplacian Eigenvector PE (LapPE) and Random-Walk PE, particularly on macromolecular datasets that require recognition of both local and long-range structures (Ngo et al., 2023).
Biomedical signals and robotics: Frequency-aware embedding (e.g., via multi-channel DWT, as used in WaveFormer) and adaptive positional encoding (e.g., DyWPE) capture temporal nonstationarity, high-frequency transients, and latent structure, leading to accuracy gains on tasks such as EEG/ECG classification and human activity recognition. FEWT improves the ACT baseline success rate on Cube Transfer (MuJoCo) from 40% (ACT) to 70% (FEWT), and in real-world tasks shows 6–12% improvement over the state-of-the-art (Irani et al., 12 Feb 2026, Huang et al., 14 Sep 2025, Irani et al., 18 Sep 2025).

The explicit ability to inspect and manipulate scale-localized features endows these models with high interpretability. Learned wavelet filters or mixing coefficients can be visualized as functions of frequency (Laplacian eigenvalue), revealing which structural modes drive predictions (Kiruluta et al., 9 May 2025).

4. Computational Complexity and Scalability

One of the motivations for wavelet-based architectures is the reduction of computational complexity versus quadratic attention. The primary computational bottlenecks and their resolutions include:

Quadratic attention scaling: Baseline self-attention scales as $x\in\mathbb R^n$ 1 per layer, with $x\in\mathbb R^n$ 2 memory.
Wavelet filtering: Naive spectral filtering requires full eigendecomposition ( $x\in\mathbb R^n$ 3) but, using truncated eigenspace ( $x\in\mathbb R^n$ 4 for $x\in\mathbb R^n$ 5), Chebyshev or polynomial approximations ( $x\in\mathbb R^n$ 6), or localized sparse convolutions, the per-layer cost can be reduced to near-linear ( $x\in\mathbb R^n$ 7 or $x\in\mathbb R^n$ 8) (Kiruluta et al., 9 May 2025, Nekoozadeh et al., 2023, Sevi et al., 2018).
Fast DWT/IDWT: Both forward and inverse DWT for sequences and 2D images incur $x\in\mathbb R^n$ 9 cost, and, when combined with grouped convolution or pointwise operations, do not exceed $L$ 0 for practical dimensions (Zhuang et al., 2022, Nekoozadeh et al., 2023).
Memory: Wavelet-based positional encodings and multi-scale fusion features use memory comparable to or better than learned absolute/relative embeddings, particularly for long sequences ( $L$ 1). Parameter count is $L$ 2, sublinear in sequence length for reasonable numbers of scales or clusters (Irani et al., 18 Sep 2025).

5. Empirical Performance and Benchmarking

Wavelet-based transformers have shown empirical superiority or parity across a variety of domains:

WMT14 En–De translation: GWT with $L$ 3 achieves $L$ 4 BLEU, outperforming standard Graph Transformer, FNet, and Linformer (27.3, 26.8, and 27.1 BLEU, respectively), with lower parameter and memory requirements (Kiruluta et al., 9 May 2025).
Long-range sequence tasks: On Long Range Arena, WavSpA and its adaptive variants deliver large accuracy improvements over Fourier-based and vanilla attention schemes, with particular gains in reasoning generalization beyond training lengths (Zhuang et al., 2022).
Macromolecular property prediction: MGT with WavePE achieves chemical accuracy on polymers, outperforming LapPE and RWPE, while yielding interpretable atom and substructure representations (Ngo et al., 2023).
Biomedical time series: WaveFormer and DyWPE yield state-of-the-art accuracy on EEG/ECG and human activity tasks, consistently winning or placing second across benchmarks. Ablations confirm that both wavelet-enhanced embeddings and dynamic, multiscale positional encodings are critical (Irani et al., 12 Feb 2026, Irani et al., 18 Sep 2025).
Robotic imitation learning: FEWT surpasses baselines like ACT and diffusion policy in simulation and real-world control benchmarks. Ablations show both multi-scale spatial (FE-EMA) and temporal (TS-DWT) wavelet modules contribute complementary gains (Huang et al., 14 Sep 2025).

Empirical gains are most pronounced for tasks involving long sequences, multi-scale structure, or strongly nonstationary dynamics.

6. Limitations, Trade-offs, and Open Challenges

While wavelet-based architectures provide interpretability, scalability, and strong empirical scaling, several limitations persist:

Spectral decomposition cost: Accurate Laplacian eigendecomposition remains $L$ 5; mitigations via truncation or polynomial approximations introduce spectral approximation errors and potential trade-offs between localization, expressiveness, and speed (Kiruluta et al., 9 May 2025).
Selection of scales: The optimal number of wavelet channels or decomposition levels is model- and task-dependent. Too few filters underrepresent structural content; too many increase overhead. Adaptive per-sample scaling or scale learning is an open direction (Kiruluta et al., 9 May 2025, Irani et al., 18 Sep 2025).
Data preprocessing: Reliability depends on the quality of the underlying graph structure (for graph models), e.g., dependency parses; noisy graphs dilute spectral inductive bias (Kiruluta et al., 9 May 2025).
Extension to dynamic or streaming settings: Efficient, accurate wavelet decomposition for time-varying graphs or online tasks is an unsolved challenge, with approximate methods a partial remedy (Kiruluta et al., 9 May 2025).
Integration with or replacement of attention: For some applications, fusing wavelet-based features rather than replacing attention may yield better flexibility, at the cost of architectural complexity (Huang et al., 14 Sep 2025, Nekoozadeh et al., 2023).

7. Distinctions from Fourier and Other Spectral Techniques

Wavelet-domain methods offer strict improvements over Fourier-based attention and spectral positional encoding by virtue of their joint localization in time (or graph node) and frequency domains. Their compact support enables effective modeling of localized edges, discontinuities, and variable granularity relationships, which are either poorly captured or non-local in global sines/cosines. Empirically, adaptive, invertible wavelet transforms preserve spatial/temporal coherence crucial for sequence-to-sequence, graph, and vision tasks, and avoid the destructive loss of positional information seen in pure spectral architectures (Zhuang et al., 2022, Nekoozadeh et al., 2023, Ngo et al., 2023).

References:

"Graph Laplacian Wavelet Transformer via Learnable Spectral Decomposition" (Kiruluta et al., 9 May 2025)
"WavSpA: Wavelet Space Attention for Boosting Transformers' Long Sequence Learning Ability" (Zhuang et al., 2022)
"Multiresolution Graph Transformers and Wavelet Positional Encoding for Learning Hierarchical Structures" (Ngo et al., 2023)
"Harmonic analysis on directed graphs and applications: from Fourier analysis to wavelets" (Sevi et al., 2018)
"WaveFormer: Wavelet Embedding Transformer for Biomedical Signals" (Irani et al., 12 Feb 2026)
"FEWT: Improving Humanoid Robot Perception with Frequency-Enhanced Wavelet-based Transformers" (Huang et al., 14 Sep 2025)
"DyWPE: Signal-Aware Dynamic Wavelet Positional Encoding for Time Series Transformers" (Irani et al., 18 Sep 2025)
"Multiscale Attention via Wavelet Neural Operators for Vision Transformers" (Nekoozadeh et al., 2023)