Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
92 tokens/sec
Gemini 2.5 Pro Premium
50 tokens/sec
GPT-5 Medium
22 tokens/sec
GPT-5 High Premium
21 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
459 tokens/sec
Kimi K2 via Groq Premium
230 tokens/sec
2000 character limit reached

Autoformer Model for Long-Horizon Forecasting

Updated 5 August 2025
  • Autoformer is a neural architecture that embeds trend-seasonal decomposition within a Transformer encoder-decoder, enhancing forecast accuracy and interpretability.
  • It replaces conventional self-attention with an FFT-accelerated auto-correlation mechanism, efficiently capturing periodic patterns with O(L log L) complexity.
  • Empirical studies show that Autoformer significantly outperforms traditional models in multivariate time series tasks across domains like energy, traffic, and epidemiology.

Autoformer refers to a family of neural architectures with two separate threads of development: (1) a decomposition-based Transformer framework for long-term time series forecasting ("Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting" (Wu et al., 2021)) and its derivatives, which have prominent impact in forecasting and biomedical domains, and (2) an automated architecture search methodology for Vision Transformers, termed AutoFormer (Chen et al., 2021). This article focuses on the time series forecasting Autoformer and its direct extensions.

Autoformer fundamentally advances time series forecasting by embedding a trend–seasonal decomposition mechanism inside a Transformer-style encoder–decoder, and by replacing conventional point-wise self-attention with a computationally efficient auto-correlation mechanism. The result is a model that achieves state-of-the-art long-horizon forecasting accuracy, high computational efficiency, and clear interpretability in multivariate real-world tasks.

1. Model Architecture and Core Principles

Autoformer implements a progressive series decomposition architecture in which each encoder and decoder layer contains a decomposition block. Specifically, given an input sequence XX, at every layer the signal is split into a trend (XtX_t) and a seasonal (XsX_s) component via a moving average operation: Xt=AvgPool(Padding(X)),Xs=XXtX_t = \operatorname{AvgPool}(\operatorname{Padding}(X)), \quad X_s = X - X_t Within the encoder, the long-term trend (XtX_t) is progressively filtered, allowing the encoder to focus on extracting and modeling fluctuating seasonal dynamics (XsX_s). In the decoder, after each processing step (auto-correlation or feed-forward network), another decomposition is applied, resulting in progressive trend accumulation and refined seasonality modeling across the decoding pathway.

The architectural design differs from vanilla Transformers in three key aspects:

  • Series decomposition as an inner operator (not merely pre-processing);
  • A dual-pathway decoder that separately accumulates and refines trend and seasonal components;
  • Replacement of standard self-attention with an auto-correlation mechanism optimized for periodic structures.

This architecture is instantiated as an encoder–decoder stack, where the encoder inputs are past observations and the decoder outputs constitute the forecasted horizon.

2. Auto-Correlation Mechanism

The Autoformer auto-correlation module is designed to replace quadratic-complexity dot-product self-attention with a sub-series-level discovery and aggregation operator that exploits periodicity and is FFT-accelerated. Formally, the autocorrelation between two time series is

Rxx(τ)=limL1Lt=1LXtXtτ\mathcal{R}_{xx}(\tau) = \lim_{L \to \infty} \frac{1}{L} \sum_{t=1}^L X_t X_{t - \tau}

In practice, for each attention head, the mechanism computes the top clogL\lfloor c\,\log L\rfloor delays {τi}\{\tau_i\} with the highest autocorrelation, then uses softmax-normalized autocorrelation scores to aggregate the value tensor VV via a "roll" operation: AutoCorrelation(Q,K,V)=i=1kRoll(V,τi)R^Q,K(τi)\text{AutoCorrelation}(Q, K, V) = \sum_{i=1}^k \operatorname{Roll}(V, \tau_i) \cdot \hat{\mathcal{R}}_{Q,K}(\tau_i) Here, FFT is used to efficiently compute the autocorrelation, yielding overall attention complexity O(LlogL)O(L \log L) and robust performance even for long sequences. This mechanism directly models temporal dependencies at the sub-series rather than pointwise level, leveraging the periodicity present in domains like energy, traffic, and weather.

3. Series Decomposition in Deep Forecasting

At the heart of Autoformer is the embedding of series decomposition as a basic and repeated operation throughout the network, not relegated to data pre-processing. The decomposition block,

SeriesDecomp(X)(Xs,Xt)\text{SeriesDecomp}(X) \to (X_s, X_t)

is uniformly applied in both encoder and decoder. This design progressively filters and accumulates trends and refines seasonality hierarchically across the network, enabling more stable and interpretable separation of short- and long-scale signal components. The scheme is tightly interleaved with the auto-correlation and feed-forward modules, ensuring that the model always operates on "detrended" (high-frequency) signals, with trend components incrementally built up throughout the decoding process.

4. Empirical Performance and Computational Efficiency

Autoformer has been benchmarked across six real-world multivariate datasets (ETT, Electricity, Exchange, Traffic, Weather, Disease/ILI), where it consistently outperforms prior attention-based models (Informer, LogTrans, Reformer) in both accuracy and efficiency:

  • In the ETT "input-96, predict-336" setting, Autoformer achieves an MSE of 0.339 versus 1.33 for prior models—a relative 74% improvement.
  • Across datasets and prediction lengths, average reductions of 38% in MSE are reported.
  • Memory and runtime benchmarks show Autoformer achieves these gains with substantially lower resource usage, attributable to the FFT-based auto-correlation operator.

A summary table organizes some of these architectural and performance differences:

Model Attention Type Decomposition Time Complexity Relative MSE (ETT)
Transformer Pointwise (Self) No O(L2)O(L^2) ~1.33
Informer ProbSparse No O(LlogL)O(L \log L) ~1.33
Autoformer Auto-correlation Yes (in-net) O(LlogL)O(L \log L) 0.339

5. Downstream Applications

Autoformer has demonstrated utility across a breadth of temporal prediction tasks:

  • Energy: High-frequency transformer load and temperature forecasting (ETT).
  • Traffic: Occupancy rate prediction using sensor networks, with both short- and long-term horizons.
  • Economics: Forecasting exchange rates across major currencies despite aperiodic and heteroscedastic structures.
  • Meteorology: Predicting weather signals with varying seasonal and trend structure.
  • Epidemiology: Forecasting influenza-like illness (ILI) cases and COVID-19 dynamics at multi-step horizons.

These applications reflect the model's ability to exploit periodic patterns and multi-scale dependencies—limitations of earlier Transformer architectures.

6. Methodological and Theoretical Context

The Autoformer contrasts with several contemporary developments:

  • Compared to "Informer" and similar sparsity-biased models, Autoformer favors sub-series aggregation via auto-correlation to leverage repetitive structures, rather than relying on selective sparsity in pointwise attention.
  • The model's in-network series decomposition is notably more robust and progressive than static pre-processing or external trend removal, aligning learned representations with physically meaningful components.
  • Critically, while Autoformer advances performance, subsequent work ("Are Transformers Effective for Time Series Forecasting?" (Zeng et al., 2022)) highlights that for certain direct multi-step settings and on some datasets, even simple one-layer linear models (LTSF-Linear) may outperform deep Transformers, suggesting the primary role of trend/seasonal separation and emphasizing the importance of inductive biases.

7. Software, Implementations, and Extensions

The official PyTorch implementation is provided at https://github.com/thuml/Autoformer, with configurations for training (Adam optimizer, learning rate 1e41\mathrm{e}{-4}, early stopping), and standard preprocessing (concatenation of past series and masked placeholders). The codebase supports reproducibility and adaptation to varied forecasting tasks.

Extensions and variants of Autoformer have addressed model interpretability (via concept bottlenecks (Sprang et al., 8 Oct 2024)), enhanced channel attention with frequency-domain representations (FECAM (Jiang et al., 2022)), handling of non-stationary data via normalization and de-stationary attention (Liu et al., 2022), and iterative multi-scale refinement (Scaleformer (Shabani et al., 2022)). Autoformer has also influenced applications in spatio-temporal contexts (graph forecasting), behavioral biometrics (gait recognition (Delgado-Santos et al., 2022)), wind speed forecasting with graph neural updates (Bentsen et al., 2022), and biomedical signal decomposition (e.g., EDA in mental health (Tsirmpas et al., 4 Jun 2025)), among others.

8. Summary and Outlook

Autoformer integrates series decomposition and auto-correlation in a unified Transformer-based architecture optimized for long-horizon time series forecasting. Its explicit modeling of trend and seasonality, sub-series dependency aggregation, and computational scalability contribute to state-of-the-art empirical results in diverse domains. The architecture’s design has catalyzed further advances in interpretability, hybrid attention schemes, and real-world applications. At the same time, critical studies caution that simple linear baselines may remain competitive or even superior in specific regimes; thus, the future direction of time series modeling may lie in combining strong inductive biases, automated search and adaptation, and problem-specific hybrid modeling strategies.