De-Stationary Attention in Sequence Models
- De-stationary attention is a family of mechanisms that preserves and reintroduces non-stationary signal characteristics for robust modeling of dynamic phenomena.
- Architectural designs such as trend-seasonality decomposition, adaptive reintroduction, and frequency-domain cross attention provide practical strategies for handling non-stationary data.
- Empirical results on financial, health, traffic, and speech datasets demonstrate significant performance improvements with minimal computational overhead compared to standard Transformers.
De-stationary attention refers to a family of attention mechanisms designed to directly model, preserve, or reintroduce non-stationarity—time-dependent drift, scale changes, or abrupt events—into neural attention maps, particularly within the context of sequence modeling, time series analysis, and dynamic signal processing. In contrast to conventional approaches that rely on stationarizing transformations to simplify data distributions (e.g., normalization, differencing), de-stationary attention either operates explicitly on non-stationary signals or injects measures of non-stationarity back into the computation of attention weights, enabling robust and adaptive modeling of phenomena with evolving statistics.
1. Core Motivation and Theoretical Underpinnings
The principal rationale for de-stationary attention arises from the inadequacy of traditional attention or self-attention mechanisms when confronted with non-stationary data. Simple stationarization strategies, such as instance normalization or z-scoring, equalize mean and variance across samples, improving predictability but at the cost of erasing level shifts, transient spikes, or regime changes—precisely the features that are often most diagnostically informative. Empirical analyses show that Transformers trained solely on stationarized inputs tend to generate attention patterns that become nearly indistinguishable across diverse input series, which reduces their capability to capture bursty or event-driven temporal dependencies (Liu et al., 2022).
De-stationary attention explicitly addresses the "over-stationarization" problem by designing mechanisms to recover or maintain non-stationarity within the attention computations. Theoretical analysis reveals that, under linear kernels, attention in time, Fourier (frequency), and even wavelet domains is equivalent. However, the softmax nonlinearity introduces significant differences in the behavior of attention when applied to non-stationary signals, motivating architecture designs that decompose non-stationary signals prior to attention or adaptively modulate attention maps using non-stationary statistics (Zhang et al., 2022).
2. Architectural Designs and Mechanistic Variants
Time Series: Trend-Seasonality Decomposition
"TDformer" exemplifies a de-stationary attention strategy in long-term time series forecasting by applying an additive decomposition of the input into trend , seasonality , and noise , i.e., . Multiple moving-average filters of varying scales extract the trend, with learned weights combining these to form . The residual isolates the stationary, seasonal component. Forecasting is then split: an MLP module extrapolates the trend, and a Fourier-domain attention module models the seasonality. This division capitalizes on the strengths of each method: MLPs excel at non-stationary extrapolation, while Fourier attention is highly sample-efficient for capturing stationary seasonal dynamics (Zhang et al., 2022).
Algorithmic pseudocode (abridged):
0
Adaptive Reintroduction of Non-Stationarity
In the "Non-stationary Transformers" framework, the de-stationary attention module operates after a series-stationarization step. Standard self-attention is replaced by a parametrization that rescales and shifts the normalized attention logits using learnable functions of the raw series’ mean and standard deviation :
Here, 0 and 1 are projections of the normalized input, and 2 are produced by small MLPs applied to 3 and 4, respectively. This mechanism approximates the transformation required to map attention maps computed over normalized data back to those that would have been obtained from raw, non-stationary inputs, preserving sensitivity to level shifts and bursts (Liu et al., 2022).
Stable/Unstable Decomposition and Cross-Domain Attention
"AEFIN" introduces a frequency-domain decomposition to separate stable (stationary, low-frequency) and unstable (non-stationary, high-frequency) components using the DFT and top-K frequency masking. Cross-attention is then used, with the unstable part (queries) attending to the stable part (keys/values). This cross-domain interaction allows information from temporally stable regions to condition predictions on temporally unstable regimes, breaking the assumption of fixed stationarity and allowing adaptive weighting of contexts (Xiong et al., 11 May 2025).
3. Empirical Results and Comparative Analyses
The efficacy of de-stationary attention is demonstrated through both synthetic and real-world data:
- In time series with fixed seasonality, Fourier-domain softmax attention rapidly concentrates on the dominant frequency, outperforming time-domain attention in convergence speed and sample efficiency. When the trend is linear (strictly non-stationary), only the MLP trend extrapolator achieves zero error—all attention variants effectively interpolate within the training window, supporting the modular decomposition (Zhang et al., 2022).
- For series exhibiting regime shifts or varying periodicity, methods with adaptive localization (e.g., cross-domain attention, wavelet attention) perform best, as they can attend to localized and dynamic changes.
- On financial, health, and traffic datasets, non-stationary Transformers with de-stationary attention attain up to 67% MSE reduction relative to vanilla Transformer models, with consistent improvements across multiple variants (Informer, Reformer, Autoformer) and negligible additional parameter overhead (<0.2%) (Liu et al., 2022).
- In speech enhancement with moving sources, attention-driven methods that adapt spatial covariance estimation based on the non-stationarity of the audio stream achieve 2–3 dB SDR (signal-to-distortion ratio) improvements over stationary filter baselines, and suffer less performance degradation under dynamic source scenarios (Wang et al., 2023).
4. Mathematical Equivalences and Theoretical Insights
Under purely linear kernels (i.e., removing the softmax nonlinearity), time-domain, Fourier-domain, and wavelet-domain attention are mathematically equivalent. For Fourier attention, the sequence
5
where 6 is the discrete Fourier matrix, and 7, 8 denote the (inverse) Fourier transform. The equivalence underlines that attention's power to model non-stationary or stationary dependencies is contingent upon the nonlinearity (softmax) and the domain in which it is applied. With softmax, the operation selectively amplifies dominant modes (e.g., principal frequencies), leading to polarization effects beneficial for periodic structure, and may filter out non-stationary noise (Zhang et al., 2022).
5. Loss Functions and Training Objectives
De-stationary attention frameworks employ composite losses to encourage both predictive accuracy and fidelity to signal structure:
- Time-domain losses: MSE for stable components, MAE for unstable components;
- Frequency-domain (spectral) losses: spectral MSE between the predicted and target magnitude spectra for components labeled as stable;
- The total loss is a weighted sum (e.g., 9) with weights chosen empirically (Xiong et al., 11 May 2025).
Supervised tasks such as speech enhancement exploit utterance-level SNR as the optimization target, with attention modules trained end-to-end for adaptive weighting and spatial filtering (Wang et al., 2023).
6. Relations to Language Modeling and Broader Sequence Models
Investigations into the structural role of stationarity in language modeling (e.g., "Deconstructing Attention") show that strict sequence-dependence (non-stationarity) in attention is not essential in every layer. Models mixing stationary attention (fixed attention maps) with standard, non-stationary attention in a hybrid fashion (every other layer) retain near-optimal performance, provided robust token mixing is present throughout. Uniformly stationary attention degrades performance, but only by 4–8% in hybrid designs, suggesting that de-stationary attention is most critical in select layers or in data regimes with pronounced non-stationarity (Xue et al., 13 Oct 2025).
7. Practical Considerations and Future Directions
Implementing de-stationary attention typically incurs minimal computational overhead (e.g., two small MLPs for scaling and shifting attention logits; additional moving-average or DFT operations). Standard training practices suffice—no auxiliary losses or datasets are required; hyperparameters such as hidden size, learning rate, and batch size follow canonical Transformer settings (Liu et al., 2022).
Extensions of de-stationary attention include handling multidimensional non-stationarity (multivariate time series, spatial-temporal signals), integrating hybrid attention along temporal, frequency, or spatial axes, and exploring adaptive mechanisms in multi-agent and multi-modal contexts. Potential limitations include increased architectural complexity and sensitivity to the decomposition and parametrization choices. Further research into self-supervised pretraining for de-stationary attention and its integration with classical statistical filtering (e.g., Kalman, particle filters) remains open (Wang et al., 2023).
References
- "First De-Trend then Attend: Rethinking Attention for Time-Series Forecasting" (Zhang et al., 2022)
- "Non-Stationary Time Series Forecasting Based on Fourier Analysis and Cross Attention Mechanism" (Xiong et al., 11 May 2025)
- "Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting" (Liu et al., 2022)
- "Attention-Driven Multichannel Speech Enhancement in Moving Sound Source Scenarios" (Wang et al., 2023)
- "Deconstructing Attention: Investigating Design Principles for Effective Language Modeling" (Xue et al., 13 Oct 2025)