The paper "Are Transformers Effective for Time Series Forecasting?" (Zeng et al., 2022 ) challenges the recent trend of applying Transformer-based models to the task of long-term time series forecasting (LTSF). The authors question whether the core mechanism of Transformers, the permutation-invariant self-attention, which is highly effective for semantic correlations in domains like NLP and Vision, is suitable for time series where temporal order is paramount and semantic meaning is often absent in the raw numerical data.
To investigate this, the paper proposes a set of "embarrassingly simple" one-layer linear models, collectively named LTSF-Linear, as a baseline for comparison. Unlike many traditional time series methods or earlier deep learning approaches that use iterated multi-step (IMS) forecasting (which can suffer from error accumulation), LTSF-Linear employs a direct multi-step (DMS) forecasting strategy, directly predicting the entire forecast horizon. The LTSF-Linear set includes:
- Vanilla Linear: A single linear layer mapping input sequence length to output sequence length independently for each variate.
- DLinear: Incorporates a decomposition scheme (similar to Autoformer and FEDformer) to separate trend and seasonal components using a moving average kernel, applying separate linear layers to each component before summing the results. This is designed to handle data with clear trends.
- NLinear: Applies sequence normalization by subtracting the last value of the look-back window before the linear layer and adding it back to the prediction. This is intended to mitigate issues arising from distribution shifts between training and testing data.
The authors conduct extensive experiments comparing LTSF-Linear variants with state-of-the-art Transformer-based LTSF models (FEDformer, Autoformer, Informer, Pyraformer, LogTrans) on nine real-world datasets covering various domains (traffic, energy, economics, weather, disease). The evaluation uses Mean Squared Error (MSE) and Mean Absolute Error (MAE).
The key findings are surprising and challenge the prevailing narrative about Transformers for LTSF:
- Superior Performance of Linear Models: LTSF-Linear models consistently and significantly outperform existing complex Transformer-based LTSF solutions on most multivariate and univariate forecasting tasks on the tested benchmarks, often achieving 20% to 50% lower errors. This suggests that for these datasets, a simple linear model with a DMS strategy is more effective than sophisticated Transformer architectures.
- Limited Benefit from Longer Inputs: Contrary to expectations that powerful models should leverage more historical data, the performance of existing Transformer models often deteriorates or plateaus as the look-back window size increases. LTSF-Linear, on the other hand, shows improved performance with larger input lengths, indicating a better ability to extract relevant temporal information from longer sequences.
- Poor Temporal Order Preservation: Experiments where the input sequences are shuffled demonstrate that Transformer models are surprisingly insensitive to the order of input data. Their performance does not drop significantly when the sequence is randomly shuffled or its halves are swapped, unlike LTSF-Linear, whose performance degrades substantially. This supports the authors' claim that the permutation-invariant nature of self-attention limits temporal modeling capability, even with positional embeddings.
- Questionable Necessity of Complex Designs: Ablation studies show that complex components in Transformers, such as specialized self-attention mechanisms or intricate embedding layers, are not always beneficial and sometimes removing them or replacing them with simpler linear layers can improve performance on these tasks.
- Dataset Size and Efficiency: The authors investigate whether small dataset size is the issue or if efficiency gains claimed by Transformer variants hold in practice. Experiments on a shortened dataset show that reduced training data does not necessarily harm performance for some Transformers, suggesting dataset scale might not be the primary limitation. Practical efficiency comparisons show that while some Transformer variants improve theoretical complexity, their actual inference time and parameter count are often similar to or worse than a modified vanilla Transformer with a DMS decoder, questioning the practical impact of their efficiency innovations on these benchmarks.
In conclusion, the paper argues that the effectiveness of Transformer-based solutions for LTSF is exaggerated on existing benchmarks. The simplicity and strong performance of LTSF-Linear highlight that capturing basic temporal properties like trends and periodicities might be sufficient for good performance on these datasets, and that the complex relational modeling of self-attention is not well-suited or necessary. The authors position LTSF-Linear as a simple, interpretable, and competitive baseline and call for future research to develop new models, data processing techniques, and benchmarks that genuinely tackle the complexities of LTSF beyond what simple linear models can achieve.