Non-Stationary Transformers

Updated 5 March 2026

Non-Stationary Transformers are specialized architectures designed to adapt Transformer models for data with evolving statistical properties over time or space.
They integrate methods like memory-based meta-learning, de-stationary attention, and wavelet-based decompositions to counteract over-stationarization and capture dynamic patterns.
Empirical benchmarks show improvements in time series forecasting, geospatial modeling, and adaptive reinforcement learning, highlighting their practical benefits in non-stationary regimes.

Non-stationary Transformers constitute a class of architectural variants, training regimes, and theoretical perspectives designed to enable Transformer models to process, predict, or adapt in the presence of non-stationary data—i.e., settings where underlying data distributions, temporal/covariate relationships, or statistical properties change over time or space. Originating from key challenges in time series forecasting, geostatistics, meta-learning, and reinforcement learning, “non-stationary Transformers” address the fundamental inadequacy of naive Transformer application in regimes with drift, changepoints, or distributional heterogeneity. Recent research systematically characterizes, extends, and benchmarks Transformer-based solutions to non-stationarity, establishing both algorithmic principles and empirical best practices.

1. Defining Non-Stationarity and Transformer Limitations

Non-stationarity manifests when the probability law governing a sequence or spatial field evolves, resulting in changing joint or marginal statistics over time or space. For sequences, this typically involves drift in mean, variance, autocorrelation, or hidden parameters, as in piecewise stationary processes with latent changepoints (Genewein et al., 2023), or in spatial domains, with locally varying trends or correlation scales (Liu et al., 2022).

Standard Transformer models, originally devised for stationary, homogeneous data (e.g., language modeling, vision tasks), rely on weight sharing and fixed-attention mechanisms that do not natively adapt to such distributional shifts. Directly training Transformers on raw non-stationary inputs may result in overfitting to transient, local phenomena or failure to capture global behaviors, yielding degraded generalization (Liu et al., 2022). Empirically, this produces both erratic predictions on raw data and “over-stationarization” when excessive normalization strips informative non-stationarity—a phenomenon where attention patterns and forecasts become indiscriminately smooth, attenuating bursty or abrupt events.

2. Algorithmic Frameworks for Non-Stationary Sequence Modeling

Recent advances formalize the adaptation of Transformers to non-stationary scenarios by integrating meta-learning, explicit data transformations, and auxiliary mechanisms.

2.1 Memory-Based Meta-Learning

Memory-based meta-learning (MBML) highlights the capacity of sequence models, notably Transformers, to approximate Bayes-optimal prediction by amortizing the posterior over hidden “tasks”—here, segments in a piecewise stationary process (Genewein et al., 2023). Formally, given a discrete task prior $\Psi$ over segmentations $\tau$ , the Bayes mixture for sequential prediction is

$\xi(x_t|x_{<t}) = \sum_{\tau} p(\tau|x_{<t})\,\tau(x_t|x_{<t})$

where posterior weights $p(\tau|x_{<t})$ are updated according to entire sequence history. Transformers, by minimizing sequential log-loss over streams sampled from realistic priors (e.g., Partition-Tree Weighting), are empirically shown to learn internal representations that embed sufficient statistics of hypothesized segments, enabling approximate Bayesian inference over changepoints and within-segment parameters—purely through their self-attention memory without explicit weight updates or gating (Genewein et al., 2023).

2.2 Time Series Forecasting: Stationarization and De-stationary Attention

The “Non-stationary Transformer” framework for time-series forecasting comprises two tightly coupled modules: Series Stationarization and De-stationary Attention (Liu et al., 2022). Series Stationarization normalizes the input along the temporal dimension, unifying mean and variance, and later restores the original statistics in the forecast output. However, this procedure alone may yield “over-stationarization,” suppressing fundamentally informative non-stationary patterns.

To counter this, De-stationary Attention introduces trainable mechanisms that re-inject lost scale and shift information into each attention layer. For normalized query-key matrices $(Q', K')$ , a learned scalar rescaling $\tau$ and shift vector $\Delta$ are computed by lightweight multi-layer perceptrons (MLPs) conditioned on the raw input’s statistics: $\mathrm{DeAttn}(Q',K',V';\tau, \Delta) = \mathrm{Softmax}\left( \frac{\tau Q'K'^\top + \mathbf{1}\Delta^\top}{\sqrt{d_k}} \right)V'$ This enables the model to regain distinguishable, burst-sensitive attention dynamics across series and regimes, preserving crucial information for forecastability in highly non-stationary settings (Liu et al., 2022).

2.3 Wavelet-based Decomposition

W-Transformers leverage time-frequency decomposition, applying the maximal overlap discrete wavelet transform (MODWT) prior to Transformer modeling (Sasal et al., 2022). Each observed univariate time series is decomposed into multiple “bandwise” components: high-frequency details ( $D_j$ ) and a low-frequency smooth part ( $S_J$ ). Separate Transformer encoder–decoder modules process each band, and final predictions are obtained by inverse MODWT aggregation: $\widehat{Y}_{N+1} = \sum_{j=1}^J \widehat{D}_{j,N+1} + \widehat{S}_{J,N+1}$ This modular approach isolates local, transient, or rapidly drifting features in high-frequency bands, while long-term trends are handled by the smooth band, improving robustness and adaptability to non-stationarity (Sasal et al., 2022).

3. Handling Non-Stationarity in Spatial and Reinforcement Learning Settings

3.1 Spatial Non-stationarity in Vision

In geospatial and computer vision applications, spatial non-stationarity arises from domain-dependent processes (e.g., compaction, sedimentation), violating standard stationarity assumptions required by classical geostatistics (Liu et al., 2022). Vision Transformers (ViT, SwinT), by virtue of global self-attention, adaptively capture both global trends (Type II nonstationarity) and long-range correlations (Type I), outperforming traditional convolutional neural networks (CNNs) whose localized receptive fields are insufficient for such heterogeneity. Notably, no explicit architectural modification is employed; the self-attention pattern, aided by learnable positional embeddings and data-driven finetuning, autonomously learns to mitigate non-stationarity (Liu et al., 2022).

3.2 Non-Stationary Reinforcement Learning

Transformers have been analytically and empirically demonstrated to attain minimax-optimal dynamic regret in non-stationary reinforcement learning, where the optimal policy may shift due to environment drift or abrupt changes (Chen et al., 22 Aug 2025). A transformer trained via in-context learning on trajectories labeled by sliding-window or reset-based expert algorithms matches optimal regret rates, such as $\widetilde{O}\left(\min\{\sqrt{JT},\,\Delta^{1/3}T^{2/3}+\sqrt{T}\}\right)$ (where $J$ is number of change-points, $\Delta$ total drift), without requiring any test-time weight update. This is achieved by leveraging depth and attention heads to encode flexible time windows and an internal “forgetting” mechanism—demonstrating that Transformers can approximate, in their architecture, the sliding window scheduler and restart machinery of expert adaptive algorithms (Chen et al., 22 Aug 2025).

4. Empirical Benchmarks and Comparative Performance

Multiple studies benchmark non-stationary Transformers across regimes and domains:

On piecewise stationary sequence tasks, Transformers approach Bayes-optimality: e.g., mean cumulative regret within 0.5–1 bits of analytic baselines such as PTW or linear model averaging, matching the oracle when the true prior is known and generalizing robustly under moderate distribution shifts (Genewein et al., 2023).
In time series forecasting, Non-stationary Transformers yield mean squared error (MSE) reductions up to 49.43% over vanilla Transformer and similar margins for Informer and Reformer, setting new state-of-the-art on challenging multivariate datasets (Liu et al., 2022).
W-Transformers significantly improve short- and long-term performance (average RMSE reduction ≃15–20%) over a wide array of statistical and deep learning baselines, particularly on datasets exhibiting long-range dependence and pronounced non-stationarity (Sasal et al., 2022).
Vision Transformers, particularly SwinT with data augmentation and transfer learning, demonstrate up to 20% relative error reduction in variogram range prediction for spatially nonstationary test cases, outperforming CNN-based models especially in Type I and Type II nonstationary regimes (Liu et al., 2022).
In reinforcement learning, Transformer policies trained on diversified non-stationary regimes replicate or outperform adaptive expert baselines (MASTER+LinUCB/TS), achieving dynamic regret scaling commensurate with non-stationarity complexity (Chen et al., 22 Aug 2025).

5. Architectural Design Principles and Best Practices

Key guidelines emerge across studies:

Self-attention sufficiency: Global attention, as realized in contemporary Transformer architectures, naturally enables the implementation of history-dependent, distributionally adaptive inference without auxiliary gating or memory modules (Genewein et al., 2023).
Positional encoding criticality: Relative or learned positional encodings outperform absolute or sinusoidal encodings, particularly for changepoint detection and adaptation in arbitrary sequence positions (Genewein et al., 2023).
Wavelet/time-frequency decomposition: Preprocessing with MODWT or equivalent transforms enhances model robustness by stratifying non-stationarity into separate spectral bands, each handled by dedicated model parameters (Sasal et al., 2022).
Data augmentation and transfer learning: For spatial non-stationarity, aggressive augmentation and transfer from large-scale sources (ImageNet) are essential to prevent overfitting and to ensure generalization, when pure transformer-based models are employed on small or domain-specific datasets (Liu et al., 2022).
Scale/shift re-injection: Beyond input normalization, explicit mechanisms to re-inject non-stationary statistics into attention computations (i.e., De-stationary Attention) are necessary to prevent over-smoothing and indistinguishable attention responses (Liu et al., 2022).
Capacity allocation: Depth, width, and attention head count should scale with the complexity and “degree” of non-stationarity; insufficient capacity leads to increased outcome variance and inferior adaptation to distributional shifts (Genewein et al., 2023, Chen et al., 22 Aug 2025).
Test-time adaptation: Mainstream approaches avoid explicit test-time adaptation (e.g., no parameter updates), instead relying on in-context learning and memory-based encoding of changing statistics (Genewein et al., 2023, Liu et al., 2022, Chen et al., 22 Aug 2025).

6. Limitations, Failure Modes, and Open Problems

Despite robust performance, certain regimes present challenges:

Over-stationarization from input normalization may suppress signal in highly stationary series, where de-stationary factors add unnecessary complexity with minor or negative benefits (Liu et al., 2022).
Distributional shift outside the pretraining support (e.g., environmental drift rates substantially above training range) degrades performance, revealing the need for comprehensive coverage in simulated data regimes (Chen et al., 22 Aug 2025).
CNNs may retain a slight edge in domains dominated by very fine-scale structure, where self-attention’s globality is less advantageous (Liu et al., 2022).
Current theoretical and empirical results predominantly cover autoregressive, piecewise stationary, or linear reward structure; extensions to general MDPs, higher-dimensional changepoint processes, and complex non-stationarities (e.g., transition kernel drift in RL) remain open (Chen et al., 22 Aug 2025).

7. Reproducibility and Implementation Resources

Codebases and reproducibility artifacts are available:

Non-stationary Transformer for forecasting: https://github.com/thuml/Nonstationary_Transformers (Liu et al., 2022)
W-Transformer for univariate time series: https://github.com/CapWidow/W-Transformer (Sasal et al., 2022)

Key implementation practices include input normalization, modular bandwise processing, relative positional encodings, data augmentation strategies, and transfer learning setups. For MODWT-based decompositions, PyWavelets’ modwt and imodwt routines are recommended (Sasal et al., 2022). Training is typically based on standard Adam optimization and cross-validated learning rates, facilitated by frameworks such as PyTorch and Darts (Liu et al., 2022, Sasal et al., 2022).

References

(Genewein et al., 2023) – Memory-Based Meta-Learning on Non-Stationary Distributions
(Liu et al., 2022) – Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting
(Sasal et al., 2022) – W-Transformers: A Wavelet-based Transformer Framework for Univariate Time Series Forecasting
(Liu et al., 2022) – Mitigation of Spatial Nonstationarity with Vision Transformers
(Chen et al., 22 Aug 2025) – Optimal Dynamic Regret by Transformers for Non-Stationary Reinforcement Learning