Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Wavelet Positional Encoding

Updated 2 March 2026
  • Dynamic Wavelet Positional Encoding (DyWPE) is a signal-aware method that embeds multi-scale temporal–spectral dynamics into transformer token representations using wavelet transforms.
  • It applies multi-level discrete wavelet decomposition and dynamic gating to extract and modulate local and global signal features for improved encoding.
  • Empirical results indicate that DyWPE boosts performance on non-stationary signals and long-context tasks while maintaining computational efficiency.

Dynamic Wavelet Positional Encoding (DyWPE) is a framework for signal-aware positional encoding in transformer models that leverages wavelet transforms to generate multi-scale, dynamically-adapted embeddings. By deriving position information directly from the input signal content—rather than token indices alone—DyWPE incorporates local temporal–spectral dynamics into token representations. This approach is particularly impactful in domains with non-stationary and multi-scale signals, such as time series, biomedical data, and long-context natural language processing.

1. Motivation and Context

Standard positional encodings (PEs) for transformers—including sinusoidal, learned absolute, and relative schemes such as RoPE and ALiBi—are fundamentally signal-agnostic; they encode position as a function of the token index, assigning identical vectors to points at the same index regardless of actual signal dynamics. In non-stationary time series, this is suboptimal: two segments at the same timestep but with fundamentally different local content (e.g., a transient spike versus a low-frequency drift) are indistinguishable to the transformer at the embedding stage. The model is thereby forced to learn time–frequency associations from scratch in upper layers, which is inefficient and can limit both performance and generalization (Irani et al., 18 Sep 2025, Irani et al., 12 Feb 2026).

2. Wavelet Theory and Signal-Aware Encoding

Wavelets provide a principled means of multi-scale, time–frequency decomposition, ideal for analyzing signals with both localized transients and long-range structure. A discrete wavelet transform (DWT) at level j=1Jj=1 \ldots J decomposes the input into approximation coefficients cAJcA_J (capturing coarse, low-frequency trends) and detail coefficients cDjcD_j (capturing progressively finer oscillatory components). By extracting these coefficients directly from the observed input, DyWPE endows positional encodings with information about both local rhythms and global structure. This inductive bias aligns with classical signal processing, promoting representations that are sensitive to genuine signal features, including non-stationary or multi-band phenomena (Irani et al., 18 Sep 2025, Irani et al., 12 Feb 2026).

3. Mathematical Formulation

DyWPE operates in several key stages:

  1. Mono-channel Projection: For multivariate input XRL×CX \in \mathbb{R}^{L \times C}, a learnable projection wchannelw_{channel} is applied to obtain a single sequence xmono[t]=c=1CX[t,c]wchannel[c]x_{mono}[t] = \sum_{c=1}^{C} X[t, c] \cdot w_{channel}[c].
  2. Multi-level DWT: JJ-level wavelet decomposition is performed, yielding cA(J)cA^{(J)} and {cD(j)}j=1J\{ cD^{(j)} \}_{j=1}^J.
  3. Dynamic Gating: Each scale’s coefficients are modulated by a learned, signal-dependent gating function:

gate(e,c)=[σ(Wge)tanh(Wve)]c\text{gate}(e, c) = [\sigma(W_g e) \odot \tanh(W_v e)] \otimes c'

where ee is an embedding derived from the coefficients, WgW_g, WvW_v are learnable weights, σ\sigma denotes the sigmoid nonlinearity, \odot denotes element-wise multiplication, and c\otimes c' broadcasts or up/downsamples cc appropriately (Irani et al., 12 Feb 2026).

  1. Inverse DWT: The modulated coefficients are recombined by perfect-reconstruction inverse DWT, yielding the positional encoding PDyWPEP_{DyWPE} of shape (B,L,dmodel)(B, L, d_{model}).
  2. Integration: PDyWPEP_{DyWPE} is added to (or concatenated with) transformer token/patched embeddings ahead of the encoder stack (Irani et al., 18 Sep 2025).

This process is summarized by the following pipeline:

Stage Operation Output Shape
Channel Projection xmono=X@wchannelx_{mono} = X @ w_{channel} (L,)(L,)
Multi-level DWT (cAJ,{cDj}j=1J)=DWTJ(xmono)(cA_J, \{cD_j\}_{j=1}^J) = DWT_J(x_{mono}) (J+1)(J+1) coefficient sets
Dynamic Gating Y^l,Y^h(j)=gate()\hat{Y}_l, \hat{Y}_h^{(j)} = \text{gate}(\cdots) (L)(L)
Inverse DWT PDyWPE=IDWTJ(Y^l,{Y^h(j)})P_{DyWPE} = IDWT_J(\hat{Y}_l, \{\hat{Y}_h^{(j)}\}) (L,)(L,)
Patchwise Summing Sample at patch centers/aggregate (N,)(N,)

4. Comparison to Traditional and Multi-Scale Position Encodings

Classical PEs such as sinusoidal or learned absolute schemes generate static vectors for each integer index, lacking sensitivity to signal characteristics (Irani et al., 18 Sep 2025). Relative schemes like RoPE encode relative distances but are limited by fixed scale and lack spectral diversity—RoPE, for instance, can be interpreted as a single-scale, Haar-like wavelet and exhibits poor extrapolation to out-of-domain sequence lengths (Oka et al., 4 Feb 2025).

ALiBi introduces attention bias windows at several fixed slopes, restricting attention span rather than encoding multi-scale structure. DyWPE overcomes these deficiencies by:

  • Adapting PEs to the actual observed signal, resulting in input-specific encodings.
  • Incorporating multiple temporal scales via wavelet decomposition, with both fine and coarse frequency content.
  • Providing smooth, non-zero tails in position bias, so attention fields are not artificially limited and extrapolation to long contexts is supported (Oka et al., 4 Feb 2025).
  • Explicitly modulating the impact of each scale’s coefficients via dynamic gating, allowing adaptation to local signal characteristics (Irani et al., 12 Feb 2026).

Empirical results show DyWPE yields superior performance in long-context and non-stationary settings, outperforming RoPE and ALiBi on perplexity and classification accuracy benchmarks (Irani et al., 18 Sep 2025, Oka et al., 4 Feb 2025).

5. Implementation Variants and Hyperparameterization

DyWPE supports a range of wavelet families: Daubechies (db1/Haar, db2, db4), Symlet-4, Ricker, and Gaussian. Experiments indicate db4 (eight-tap orthogonal) offers a trade-off between localization and resolution for biomedical signals (Irani et al., 12 Feb 2026). The number of decomposition levels JJ is typically set between 1 and 4; J=3J=3 often provides optimal accuracy before diminishing returns.

For relative-position DyWPE in language modeling, the base wavelet (often Ricker) and the scale/shift grid are selected to maximize coverage and spectral diversity across embedding dimensions. Key points include:

  • For dimension dd and ss scales, d/sd/s shifts are assigned, resulting in s(d/s)=ds \cdot (d/s) = d total basis wavelets.
  • Amplitude normalization and z-scoring are used to stabilize the magnitude of PDyWPEP_{DyWPE} across sequences, controlled by a learnable scaling factor.
  • Ablation studies indicate that both multi-scale decomposition and signal-aware gating are essential; disabling either leads to 1–2% average drops in accuracy, with the largest declines on long or highly non-stationary series (Irani et al., 18 Sep 2025, Irani et al., 12 Feb 2026).

Parameter sweeps show that performance plateaus at moderate scale diversity; too many scales can introduce redundancy, while too few degrade fine-detail sensitivity (Oka et al., 4 Feb 2025).

6. Empirical Results and Computational Efficiency

Across diverse datasets—including EEG, human activity recognition, and device/occupancy sensors—DyWPE achieves top-1 or top-2 accuracy ranks against eight state-of-the-art PE baselines. On Sleep EEG, DyWPE improves top-1 accuracy from 84.1% (sinusoidal) to 88.2%, and for SelfRegulationSCP2 (length 1152) from 51.2% to 61.2%, with average relative improvements of 9.1% on biomedical signals (Irani et al., 18 Sep 2025, Irani et al., 12 Feb 2026).

In language modeling, DyWPE exhibits strong extrapolation and robustness to long contexts, maintaining low perplexity beyond the training sequence lengths where RoPE or fixed-scale methods fail (Oka et al., 4 Feb 2025).

Computationally, DyWPE is efficient, introducing O(Ldmodel)O(L \cdot d_{model}) time and space complexity and a modest training overhead (1.48× that of sinusoidal PE). Efficient implementations utilize precomputation and scatter tricks to ensure memory scales as O(dL)O(dL), even for long contexts (Oka et al., 4 Feb 2025, Irani et al., 12 Feb 2026).

7. Extensions and Future Prospects

DyWPE generalizes across both absolute and relative-position encoding paradigms:

  • As an absolute encoding, it provides signal-aware offsets to token embeddings in time series transformers.
  • As a relative encoding, it supplies a bank of multi-scale, smoothly decaying biases for self-attention, enabling unlimited receptive field and effective extrapolation (Oka et al., 4 Feb 2025).

Further research directions include:

  • Learning custom or adaptive wavelet bases beyond standard families.
  • Data-driven selection of decomposition level JJ.
  • Integration with or augmentation of rotary or hybrid relative schemes.
  • Extension to regression/forecasting, irregularly-sampled time series, and cross-modal applications (audio, vibration, ECG) (Irani et al., 18 Sep 2025).
  • Robustness to adversarial or distributional shift by leveraging classical signal-processing priors.

By directly coupling positional encoding with observed signal content and local temporal–spectral structure, DyWPE establishes a new class of positional representations that unify data-driven and inductive-bias-driven modeling for time series and long-context transformers (Irani et al., 18 Sep 2025, Irani et al., 12 Feb 2026, Oka et al., 4 Feb 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Wavelet Positional Encoding (DyWPE).