Dynamic Wavelet Positional Encoding

Updated 2 March 2026

Dynamic Wavelet Positional Encoding (DyWPE) is a signal-aware method that embeds multi-scale temporal–spectral dynamics into transformer token representations using wavelet transforms.
It applies multi-level discrete wavelet decomposition and dynamic gating to extract and modulate local and global signal features for improved encoding.
Empirical results indicate that DyWPE boosts performance on non-stationary signals and long-context tasks while maintaining computational efficiency.

Dynamic Wavelet Positional Encoding (DyWPE) is a framework for signal-aware positional encoding in transformer models that leverages wavelet transforms to generate multi-scale, dynamically-adapted embeddings. By deriving position information directly from the input signal content—rather than token indices alone—DyWPE incorporates local temporal–spectral dynamics into token representations. This approach is particularly impactful in domains with non-stationary and multi-scale signals, such as time series, biomedical data, and long-context natural language processing.

1. Motivation and Context

Standard positional encodings (PEs) for transformers—including sinusoidal, learned absolute, and relative schemes such as RoPE and ALiBi—are fundamentally signal-agnostic; they encode position as a function of the token index, assigning identical vectors to points at the same index regardless of actual signal dynamics. In non-stationary time series, this is suboptimal: two segments at the same timestep but with fundamentally different local content (e.g., a transient spike versus a low-frequency drift) are indistinguishable to the transformer at the embedding stage. The model is thereby forced to learn time–frequency associations from scratch in upper layers, which is inefficient and can limit both performance and generalization (Irani et al., 18 Sep 2025, Irani et al., 12 Feb 2026).

2. Wavelet Theory and Signal-Aware Encoding

Wavelets provide a principled means of multi-scale, time–frequency decomposition, ideal for analyzing signals with both localized transients and long-range structure. A discrete wavelet transform (DWT) at level $j=1 \ldots J$ decomposes the input into approximation coefficients $cA_J$ (capturing coarse, low-frequency trends) and detail coefficients $cD_j$ (capturing progressively finer oscillatory components). By extracting these coefficients directly from the observed input, DyWPE endows positional encodings with information about both local rhythms and global structure. This inductive bias aligns with classical signal processing, promoting representations that are sensitive to genuine signal features, including non-stationary or multi-band phenomena (Irani et al., 18 Sep 2025, Irani et al., 12 Feb 2026).

3. Mathematical Formulation

DyWPE operates in several key stages:

Mono-channel Projection: For multivariate input $X \in \mathbb{R}^{L \times C}$ , a learnable projection $w_{channel}$ is applied to obtain a single sequence $x_{mono}[t] = \sum_{c=1}^{C} X[t, c] \cdot w_{channel}[c]$ .
Multi-level DWT: $J$ -level wavelet decomposition is performed, yielding $cA^{(J)}$ and $\{ cD^{(j)} \}_{j=1}^J$ .
Dynamic Gating: Each scale’s coefficients are modulated by a learned, signal-dependent gating function:

$\text{gate}(e, c) = [\sigma(W_g e) \odot \tanh(W_v e)] \otimes c'$

where $cA_J$ 0 is an embedding derived from the coefficients, $cA_J$ 1, $cA_J$ 2 are learnable weights, $cA_J$ 3 denotes the sigmoid nonlinearity, $cA_J$ 4 denotes element-wise multiplication, and $cA_J$ 5 broadcasts or up/downsamples $cA_J$ 6 appropriately (Irani et al., 12 Feb 2026).

Inverse DWT: The modulated coefficients are recombined by perfect-reconstruction inverse DWT, yielding the positional encoding $cA_J$ 7 of shape $cA_J$ 8.
Integration: $cA_J$ 9 is added to (or concatenated with) transformer token/patched embeddings ahead of the encoder stack (Irani et al., 18 Sep 2025).

This process is summarized by the following pipeline:

Stage	Operation	Output Shape
Channel Projection	$cD_j$ 0	$cD_j$ 1
Multi-level DWT	$cD_j$ 2	$cD_j$ 3 coefficient sets
Dynamic Gating	$cD_j$ 4	$cD_j$ 5
Inverse DWT	$cD_j$ 6	$cD_j$ 7
Patchwise Summing	Sample at patch centers/aggregate	$cD_j$ 8

4. Comparison to Traditional and Multi-Scale Position Encodings

Classical PEs such as sinusoidal or learned absolute schemes generate static vectors for each integer index, lacking sensitivity to signal characteristics (Irani et al., 18 Sep 2025). Relative schemes like RoPE encode relative distances but are limited by fixed scale and lack spectral diversity—RoPE, for instance, can be interpreted as a single-scale, Haar-like wavelet and exhibits poor extrapolation to out-of-domain sequence lengths (Oka et al., 4 Feb 2025).

ALiBi introduces attention bias windows at several fixed slopes, restricting attention span rather than encoding multi-scale structure. DyWPE overcomes these deficiencies by:

Adapting PEs to the actual observed signal, resulting in input-specific encodings.
Incorporating multiple temporal scales via wavelet decomposition, with both fine and coarse frequency content.
Providing smooth, non-zero tails in position bias, so attention fields are not artificially limited and extrapolation to long contexts is supported (Oka et al., 4 Feb 2025).
Explicitly modulating the impact of each scale’s coefficients via dynamic gating, allowing adaptation to local signal characteristics (Irani et al., 12 Feb 2026).

Empirical results show DyWPE yields superior performance in long-context and non-stationary settings, outperforming RoPE and ALiBi on perplexity and classification accuracy benchmarks (Irani et al., 18 Sep 2025, Oka et al., 4 Feb 2025).

5. Implementation Variants and Hyperparameterization

DyWPE supports a range of wavelet families: Daubechies (db1/Haar, db2, db4), Symlet-4, Ricker, and Gaussian. Experiments indicate db4 (eight-tap orthogonal) offers a trade-off between localization and resolution for biomedical signals (Irani et al., 12 Feb 2026). The number of decomposition levels $cD_j$ 9 is typically set between 1 and 4; $X \in \mathbb{R}^{L \times C}$ 0 often provides optimal accuracy before diminishing returns.

For relative-position DyWPE in language modeling, the base wavelet (often Ricker) and the scale/shift grid are selected to maximize coverage and spectral diversity across embedding dimensions. Key points include:

For dimension $X \in \mathbb{R}^{L \times C}$ 1 and $X \in \mathbb{R}^{L \times C}$ 2 scales, $X \in \mathbb{R}^{L \times C}$ 3 shifts are assigned, resulting in $X \in \mathbb{R}^{L \times C}$ 4 total basis wavelets.
Amplitude normalization and z-scoring are used to stabilize the magnitude of $X \in \mathbb{R}^{L \times C}$ 5 across sequences, controlled by a learnable scaling factor.
Ablation studies indicate that both multi-scale decomposition and signal-aware gating are essential; disabling either leads to 1–2% average drops in accuracy, with the largest declines on long or highly non-stationary series (Irani et al., 18 Sep 2025, Irani et al., 12 Feb 2026).

Parameter sweeps show that performance plateaus at moderate scale diversity; too many scales can introduce redundancy, while too few degrade fine-detail sensitivity (Oka et al., 4 Feb 2025).

6. Empirical Results and Computational Efficiency

Across diverse datasets—including EEG, human activity recognition, and device/occupancy sensors—DyWPE achieves top-1 or top-2 accuracy ranks against eight state-of-the-art PE baselines. On Sleep EEG, DyWPE improves top-1 accuracy from 84.1% (sinusoidal) to 88.2%, and for SelfRegulationSCP2 (length 1152) from 51.2% to 61.2%, with average relative improvements of 9.1% on biomedical signals (Irani et al., 18 Sep 2025, Irani et al., 12 Feb 2026).

In language modeling, DyWPE exhibits strong extrapolation and robustness to long contexts, maintaining low perplexity beyond the training sequence lengths where RoPE or fixed-scale methods fail (Oka et al., 4 Feb 2025).

Computationally, DyWPE is efficient, introducing $X \in \mathbb{R}^{L \times C}$ 6 time and space complexity and a modest training overhead (1.48× that of sinusoidal PE). Efficient implementations utilize precomputation and scatter tricks to ensure memory scales as $X \in \mathbb{R}^{L \times C}$ 7, even for long contexts (Oka et al., 4 Feb 2025, Irani et al., 12 Feb 2026).

7. Extensions and Future Prospects

DyWPE generalizes across both absolute and relative-position encoding paradigms:

As an absolute encoding, it provides signal-aware offsets to token embeddings in time series transformers.
As a relative encoding, it supplies a bank of multi-scale, smoothly decaying biases for self-attention, enabling unlimited receptive field and effective extrapolation (Oka et al., 4 Feb 2025).

Further research directions include:

Learning custom or adaptive wavelet bases beyond standard families.
Data-driven selection of decomposition level $X \in \mathbb{R}^{L \times C}$ 8.
Integration with or augmentation of rotary or hybrid relative schemes.
Extension to regression/forecasting, irregularly-sampled time series, and cross-modal applications (audio, vibration, ECG) (Irani et al., 18 Sep 2025).
Robustness to adversarial or distributional shift by leveraging classical signal-processing priors.

By directly coupling positional encoding with observed signal content and local temporal–spectral structure, DyWPE establishes a new class of positional representations that unify data-driven and inductive-bias-driven modeling for time series and long-context transformers (Irani et al., 18 Sep 2025, Irani et al., 12 Feb 2026, Oka et al., 4 Feb 2025).