Are Transformers Effective for Time Series Forecasting? (2205.13504v3)

Published 26 May 2022 in cs.AI and cs.LG

Abstract: Recently, there has been a surge of Transformer-based solutions for the long-term time series forecasting (LTSF) task. Despite the growing performance over the past few years, we question the validity of this line of research in this work. Specifically, Transformers is arguably the most successful solution to extract the semantic correlations among the elements in a long sequence. However, in time series modeling, we are to extract the temporal relations in an ordered set of continuous points. While employing positional encoding and using tokens to embed sub-series in Transformers facilitate preserving some ordering information, the nature of the \emph{permutation-invariant} self-attention mechanism inevitably results in temporal information loss. To validate our claim, we introduce a set of embarrassingly simple one-layer linear models named LTSF-Linear for comparison. Experimental results on nine real-life datasets show that LTSF-Linear surprisingly outperforms existing sophisticated Transformer-based LTSF models in all cases, and often by a large margin. Moreover, we conduct comprehensive empirical studies to explore the impacts of various design elements of LTSF models on their temporal relation extraction capability. We hope this surprising finding opens up new research directions for the LTSF task. We also advocate revisiting the validity of Transformer-based solutions for other time series analysis tasks (e.g., anomaly detection) in the future. Code is available at: \url{https://github.com/cure-lab/LTSF-Linear}.

PDF Abstract

The paper "Are Transformers Effective for Time Series Forecasting?" (Zeng et al., 2022 ) challenges the recent trend of applying Transformer-based models to the task of long-term time series forecasting (LTSF). The authors question whether the core mechanism of Transformers, the permutation-invariant self-attention, which is highly effective for semantic correlations in domains like NLP and Vision, is suitable for time series where temporal order is paramount and semantic meaning is often absent in the raw numerical data.

To investigate this, the paper proposes a set of "embarrassingly simple" one-layer linear models, collectively named LTSF-Linear, as a baseline for comparison. Unlike many traditional time series methods or earlier deep learning approaches that use iterated multi-step (IMS) forecasting (which can suffer from error accumulation), LTSF-Linear employs a direct multi-step (DMS) forecasting strategy, directly predicting the entire forecast horizon. The LTSF-Linear set includes:

Vanilla Linear: A single linear layer mapping input sequence length $L$ to output sequence length $T$ independently for each variate.
DLinear: Incorporates a decomposition scheme (similar to Autoformer and FEDformer) to separate trend and seasonal components using a moving average kernel, applying separate linear layers to each component before summing the results. This is designed to handle data with clear trends.
NLinear: Applies sequence normalization by subtracting the last value of the look-back window before the linear layer and adding it back to the prediction. This is intended to mitigate issues arising from distribution shifts between training and testing data.

The authors conduct extensive experiments comparing LTSF-Linear variants with state-of-the-art Transformer-based LTSF models (FEDformer, Autoformer, Informer, Pyraformer, LogTrans) on nine real-world datasets covering various domains (traffic, energy, economics, weather, disease). The evaluation uses Mean Squared Error (MSE) and Mean Absolute Error (MAE).

The key findings are surprising and challenge the prevailing narrative about Transformers for LTSF:

Superior Performance of Linear Models: LTSF-Linear models consistently and significantly outperform existing complex Transformer-based LTSF solutions on most multivariate and univariate forecasting tasks on the tested benchmarks, often achieving 20% to 50% lower errors. This suggests that for these datasets, a simple linear model with a DMS strategy is more effective than sophisticated Transformer architectures.
Limited Benefit from Longer Inputs: Contrary to expectations that powerful models should leverage more historical data, the performance of existing Transformer models often deteriorates or plateaus as the look-back window size increases. LTSF-Linear, on the other hand, shows improved performance with larger input lengths, indicating a better ability to extract relevant temporal information from longer sequences.
Poor Temporal Order Preservation: Experiments where the input sequences are shuffled demonstrate that Transformer models are surprisingly insensitive to the order of input data. Their performance does not drop significantly when the sequence is randomly shuffled or its halves are swapped, unlike LTSF-Linear, whose performance degrades substantially. This supports the authors' claim that the permutation-invariant nature of self-attention limits temporal modeling capability, even with positional embeddings.
Questionable Necessity of Complex Designs: Ablation studies show that complex components in Transformers, such as specialized self-attention mechanisms or intricate embedding layers, are not always beneficial and sometimes removing them or replacing them with simpler linear layers can improve performance on these tasks.
Dataset Size and Efficiency: The authors investigate whether small dataset size is the issue or if efficiency gains claimed by Transformer variants hold in practice. Experiments on a shortened dataset show that reduced training data does not necessarily harm performance for some Transformers, suggesting dataset scale might not be the primary limitation. Practical efficiency comparisons show that while some Transformer variants improve theoretical complexity, their actual inference time and parameter count are often similar to or worse than a modified vanilla Transformer with a DMS decoder, questioning the practical impact of their efficiency innovations on these benchmarks.

In conclusion, the paper argues that the effectiveness of Transformer-based solutions for LTSF is exaggerated on existing benchmarks. The simplicity and strong performance of LTSF-Linear highlight that capturing basic temporal properties like trends and periodicities might be sufficient for good performance on these datasets, and that the complex relational modeling of self-attention is not well-suited or necessary. The authors position LTSF-Linear as a simple, interpretable, and competitive baseline and call for future research to develop new models, data processing techniques, and benchmarks that genuinely tackle the complexities of LTSF beyond what simple linear models can achieve.