Wavelet-Token Embeddings for Transformers

Updated 26 April 2026

The paper introduces a learnable wavelet-token embedding approach that decomposes input sequences into multiresolution approximation and detail coefficients using recursive discrete wavelet transforms.
It integrates wavelet transforms into transformers by replacing or augmenting self-attention with scale-specific processing, thereby maintaining global context while capturing local details.
Empirical results show improved computational efficiency and modeling fidelity across tasks such as sequence modeling, vision, and time series forecasting, with reductions in training time and memory usage.

Wavelet-token embedding for transformers is a class of techniques that replaces or augments standard tokenization and attention by incorporating multiscale wavelet transforms within the transformer pipeline. The central idea is to perform explicit, often learnable, discrete wavelet transform (DWT) decompositions on input sequences or intermediate representations, thereby constructing token embeddings that reflect both temporally localized (or spatially localized for images) and multiresolution content. These embeddings can then serve as direct transformer inputs or as the computational substrate for efficient attention-like mixing, often yielding substantial algorithmic and empirical advantages in sequence modeling, spectral fidelity, and computational efficiency.

1. Mathematical Principles of Wavelet-Token Embedding

Wavelet-token embedding leverages the signal-processing foundations of multiresolution analysis. For a signal $x$ , the one-level (Haar) forward transform is

$a_i = \frac{x_{2i} + x_{2i+1}}{\sqrt{2}}, \qquad d_i = \frac{x_{2i} - x_{2i+1}}{\sqrt{2}}$

with corresponding synthesis equations. For embedding vectors, this generalizes to learnable, per-dimension transformations. The Learnable Multi-Scale Wavelet Transformer (LMWT) constructs, for each pair of adjacent token embeddings $(\mathbf{x}_{2i}, \mathbf{x}_{2i+1})$ , scale-specific approximation and detail vectors: $\mathbf{a}_i = \boldsymbol\alpha \odot \mathbf{x}_{2i} + \boldsymbol\beta \odot \mathbf{x}_{2i+1}, \quad \mathbf{d}_i = \boldsymbol\gamma \odot \mathbf{x}_{2i} + \boldsymbol\delta \odot \mathbf{x}_{2i+1}$ where $\odot$ denotes elementwise multiplication and $\boldsymbol\alpha, \boldsymbol\beta, \boldsymbol\gamma, \boldsymbol\delta \in \mathbb{R}^d$ are learned (Kiruluta et al., 8 Apr 2025). Multiple decomposition levels are realized via recursion, successively feeding the approximation vectors into further stages: $\mathbf{x}^{(l+1)} \gets \text{sequence of }\mathbf{a}_i^{(l)}$ Wavelet decompositions on multidimensional (image) data operate via separable 2D filters (e.g., Haar, Daubechies, biorthogonal), producing subbands (LL, LH, HL, HH), each encoding different frequency and directionality content (Yao et al., 2022, Esteves et al., 2024).

2. Integration Within Transformer Architectures

The integration of wavelet-token embedding into transformer blocks is modality-dependent and admits multiple designs:

Replacement of Self-Attention: In LMWT, the multi-scale wavelet module replaces self-attention entirely within each encoder (or decoder) block. The transformed and aggregated wavelet token representations $\mathbf{Y}_{\text{wavelet}}$ return to the standard transformer pipeline after addition and normalization (Kiruluta et al., 8 Apr 2025).
Preprocessing/Patchification: For vision applications, images are first patchified and then subject to 2D DWT, generating multiresolution tokens for transformer processing. In models such as Wave-ViT, wavelet-based downsampling is used on the key/value branches, achieving invertible reduction of the sequence length and capturing high-frequency details missed by pooling (Yao et al., 2022, Zhu et al., 2024).
Intermediate Representation Augmentation: Architectures like Wavelet GPT inject wavelet-mixed versions of intermediate activations at each layer, fusing fine- and coarse-scale content into the transformer’s internal state (Verma, 2024).
Attention in Wavelet Domain: Several approaches (e.g., WavSpA, Multiscale Wavelet Attention) reformulate the attention operation itself in the wavelet domain: tokens are first decomposed, attention is performed on the concatenated coefficient streams, and then the result is reconstructed to the original domain (Zhuang et al., 2022, Nekoozadeh et al., 2023).

Integration does not require significant modification of the rest of the transformer pipeline: feed-forward, normalization, and residual connections are preserved.

3. Wavelet Embedding Construction and Aggregation

The process of building wavelet-token embeddings proceeds as follows:

Multi-Level Decomposition: Multiple recursive levels of DWT produce sets of approximation (low-frequency) and detail (high-frequency) coefficients per scale.
Aggregation: The collection of detail and final approximation coefficients is upsampled/tiled to the original sequence length, concatenated or combined via weighting, and then linearly projected to the desired embedding space. For LMWT,

$\mathbf{Y}_{\text{wavelet}} = \text{Combine}\bigl(\{\mathbf{d}^{(l)}\}, \mathbf{a}^{(L-1)}\bigr)\;\mathbf{W}_{\text{out}}$

where $\mathbf{W}_{\text{out}}$ is trainable (Kiruluta et al., 8 Apr 2025).

Quantized Token Streams (Discrete Tokenization): For generative modeling, DWT coefficients can be quantized into discrete token vocabularies (often with VQ-VAE codebooks), and concatenated in coarse-to-fine scale order for input to autoregressive transformers (Masserano et al., 2024, Esteves et al., 2024). This allows conditional generation, interactive preview, and scale-causal decoding.
Fusion with Raw Features: Signal models (e.g., WaveFormer) may concatenate wavelet-domain features with raw time-domain features at the patch or token level before transformer encoding, maximizing information available to downstream layers (Irani et al., 12 Feb 2026).

4. Computational Complexity and Efficiency

Wavelet-token embedding mechanisms are engineered for computational scalability:

Linear Complexity: A $a_i = \frac{x_{2i} + x_{2i+1}}{\sqrt{2}}, \qquad d_i = \frac{x_{2i} - x_{2i+1}}{\sqrt{2}}$ 0-length sequence (or $a_i = \frac{x_{2i} + x_{2i+1}}{\sqrt{2}}, \qquad d_i = \frac{x_{2i} - x_{2i+1}}{\sqrt{2}}$ 1 tokens for images) admits $a_i = \frac{x_{2i} + x_{2i+1}}{\sqrt{2}}, \qquad d_i = \frac{x_{2i} - x_{2i+1}}{\sqrt{2}}$ 2 cost for multi-level wavelet transformation, as opposed to $a_i = \frac{x_{2i} + x_{2i+1}}{\sqrt{2}}, \qquad d_i = \frac{x_{2i} - x_{2i+1}}{\sqrt{2}}$ 3 for self-attention (Kiruluta et al., 8 Apr 2025).
Token Length Reduction: Vision models can aggressively reduce spatial resolution via hierarchical DWT without irreversible loss of detail, significantly curtailing attention bottlenecks (Yao et al., 2022, Zhu et al., 2024).
Multiresolution Architectures: Hierarchical Resolution Transformers (HRT) utilize wavelet-inspired filter banks and exponential sequence reduction to achieve $a_i = \frac{x_{2i} + x_{2i+1}}{\sqrt{2}}, \qquad d_i = \frac{x_{2i} - x_{2i+1}}{\sqrt{2}}$ 4 complexity, with multi-stage processing at variable resolutions (Sar et al., 24 Sep 2025). Memory and latency reductions of up to 42% and 37% respectively are reported, with accuracy gains over baselines.
Spectrum-Preserving Attention: Competitive models such as Multiscale Wavelet Transformers for operator learning maintain high-frequency tokens explicitly, reducing spectral bias and stabilizing long-horizon dynamics (Wang et al., 1 Feb 2026).

5. Learnable and Adaptive Wavelet Parameters

Learnability is a critical extension over classical, fixed wavelet transforms:

Trainable Wavelet Coefficients: In LMWT, all transform coefficients ( $a_i = \frac{x_{2i} + x_{2i+1}}{\sqrt{2}}, \qquad d_i = \frac{x_{2i} - x_{2i+1}}{\sqrt{2}}$ 5, $a_i = \frac{x_{2i} + x_{2i+1}}{\sqrt{2}}, \qquad d_i = \frac{x_{2i} - x_{2i+1}}{\sqrt{2}}$ 6, $a_i = \frac{x_{2i} + x_{2i+1}}{\sqrt{2}}, \qquad d_i = \frac{x_{2i} - x_{2i+1}}{\sqrt{2}}$ 7, $a_i = \frac{x_{2i} + x_{2i+1}}{\sqrt{2}}, \qquad d_i = \frac{x_{2i} - x_{2i+1}}{\sqrt{2}}$ 8 per scale and dimension) are fully trainable, initialized near Haar values and adapted via backpropagation to optimize the decomposition for the data and task (Kiruluta et al., 8 Apr 2025).
Learned vs. Fixed Filter Banks: Adaptive strategies for wavelet filters include direct parameterization with quadrature-mirror constraints, orthogonal parameterizations (e.g., via Givens rotations), and learning via lifting schemes (Zhuang et al., 2022).
Position and Scale Embedding: Positional encodings can be generalized by using discrete wavelet transforms over position indices with learnable scales and shifts, providing length-extrapolating, multi-scale-aware bias terms in attention (Oka et al., 4 Feb 2025).

Learned wavelet parameters enable models to flexibly trade off between localized and global features, modifying the energy compaction and interpretability of the resulting representations.

6. Empirical Results and Applications

Wavelet-token embedding has shown broad empirical advantages:

Sequence Modeling: In machine translation (WMT16 En→De), LMWT matched baseline BLEU scores (27.2 vs 27.8) at 30–50% higher training speed (Kiruluta et al., 8 Apr 2025). Interpretability is demonstrated through learned coefficient heatmaps exhibiting structured multi-scale patterns.
Time Series Forecasting: Discrete, wavelet-token approaches such as WaveToken outperform foundation models and deep learning baselines across 42 datasets, reflecting improved quantile losses, point error, and spectral metrics (Masserano et al., 2024). W-Transformers (MODWT plus local transformers) also rank first on classical forecasting metrics on seven benchmarks (Sasal et al., 2022).
Vision: On ImageNet-1K, wavelet-based tokenizers increase throughput and improve top-1 accuracy relative to patch-based ViTs, especially for large images and small token counts (Zhu et al., 2024, Yao et al., 2022). Spectral Image Tokenizer methods facilitate partial decoding, multiscale upsampling, and interactive editing (Esteves et al., 2024).
Dynamical Systems and Operator Learning: Multi-scale Wavelet Transformers demonstrate reduced error and spectral bias in multi-step forecasts on Navier–Stokes and climate data (Wang et al., 1 Feb 2026).
Language: HRTs surpass BERT/GPT-style baselines on GLUE, SuperGLUE, and LRA by 3–6 points, with substantial efficiency gains. Wavelet GPT achieves up to 85% reduction in pretraining steps required for equivalent performance across modalities, with no increase in parameter count for fixed kernels (Verma, 2024, Sar et al., 24 Sep 2025).

7. Interpretability and Theoretical Implications

Hierarchically organized wavelet coefficients yield interpretable, physically and linguistically meaningful representations:

Visualization: Learned wavelet coefficients reveal scale-specific activation patterns. Fine scales correspond to localized, high-frequency regions (e.g., token-level syntactic features, image textures), while coarse scales encode long-range dependencies and global context (Kiruluta et al., 8 Apr 2025, Esteves et al., 2024).
Decomposition-Driven Generalization: The energy-compacting property of wavelets enables compact vocabularies (e.g., 1024 tokens) sufficient for high-fidelity modeling and generalization across in- and out-of-domain tasks (Masserano et al., 2024).
Extrapolative Capacity: Multi-scale, wavelet-inspired positional encodings allow transformers to extrapolate to sequences much longer than the training regime without truncating the receptive field or introducing bias decay (Oka et al., 4 Feb 2025).

The wavelet-token embedding paradigm systematically aligns transformer computations with the inherent multi-resolution structure observed in language, vision, time series, and dynamical systems, providing both computational efficiency and modeling fidelity through explicit, learnable multiscale representations.