A Time Series is Worth 64 Words: Long-term Forecasting with Transformers (2211.14730v2)

Published 27 Nov 2022 in cs.LG and cs.AI

Abstract: We propose an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. Patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history. Our channel-independent patch time series Transformer (PatchTST) can improve the long-term forecasting accuracy significantly when compared with that of SOTA Transformer-based models. We also apply our model to self-supervised pre-training tasks and attain excellent fine-tuning performance, which outperforms supervised training on large datasets. Transferring of masked pre-trained representation on one dataset to others also produces SOTA forecasting accuracy. Code is available at: https://github.com/yuqinie98/PatchTST.

PDF Abstract

Long-term Forecasting with Transformers: PatchTST

The paper "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers" introduces PatchTST, a novel Transformer-based model designed for multivariate time series forecasting. This paper provides a comprehensive approach that focuses on enhancing the Transformer model through the integration of segmentation and channel-independence mechanisms. The empirical results demonstrate significant improvements over state-of-the-art (SOTA) models, both in forecasting accuracy and computation efficiency.

Key Innovations

The authors propose two main advancements in Transformer-based time series models:

Patching: The method involves segmenting time series data into subseries-level patches, which are then fed into the Transformer as input tokens. This technique retains local semantic information, significantly reduces computation and memory usage due to a decrease in the number of input tokens, and effectively extends the receptive field of the model.
Channel-Independence: Each channel, representing a single univariate time series, shares the same embedding and Transformer weights across all series. This approach circumvents the complications associated with channel-mixing, wherein the attention network must learn from a combined input vector of all time series. Channel-independence allows each channel to have its own attention patterns, enhancing adaptability and reducing the risk of overfitting.

Empirical Results

The paper reports extensive experiments on eight popular datasets: Weather, Traffic, Electricity, Influenza-like Illness (ILI), and four ETT datasets. PatchTST is benchmarked against SOTA models like Informer, Autoformer, FEDformer, Pyraformer, and DLinear. Key findings from the experiments include:

Forecasting Accuracy: PatchTST/64 achieves an overall reduction of 21.0% in Mean Squared Error (MSE) compared to the best results from existing Transformer-based models and 20.2% when using PatchTST/42.
Efficiency: The patching technique dramatically decreases the time and space complexity of the model, with training times being reduced by up to 22 times on large datasets.
Look-back Window: PatchTST consistently benefits from longer look-back windows, evidenced by the reduction in MSE as the look-back window increases from 96 to 720.

Self-Supervised Representation Learning

The authors also explore the efficacy of PatchTST in self-supervised learning scenarios, leveraging masked autoencoder techniques. The model's ability to learn high-level abstract representations of time series data is validated through fine-tuning and transfer learning tasks. Key insights include:

Fine-tuning Performance: Self-supervised PatchTST outperforms training from scratch with fine-tuning, displaying superior results in most cases.
Transfer Learning: The model pre-trained on the Electricity dataset achieves commendable forecasting accuracy when transferred to other datasets, showcasing robustness and generalizability.
Comparison with Other Self-supervised Methods: PatchTST demonstrates substantial improvements over existing self-supervised learning methods such as BTSF, TS2Vec, TNC, and TS-TCC.

Channel-Independence Analysis

To better understand the effectiveness of channel-independence, several experiments highlight its impact:

Adaptability: Channel-independent models allow each time series to learn distinct attention patterns, which is crucial for datasets with series exhibiting different behaviors.
Training Efficiency and Overfitting: Channel-independent setups converge faster and are less prone to overfitting compared to channel-mixing approaches, particularly when the training data is limited.

Future Directions

The paper concludes with the potential for PatchTST to serve as a foundational model for future research in Transformer-based time series forecasting. The simplicity and effectiveness of patching are underscored, along with the possibility of extending channel-independence to incorporate inter-channel relationships.

Conclusion

PatchTST presents a significant step forward in the domain of time series forecasting using Transformers. By addressing the limitations of channel-mixing and leveraging the power of patches, it sets a new benchmark for accuracy and efficiency. The model's strong performance in both supervised and self-supervised settings, along with its robustness in transfer learning tasks, marks it as a critical contribution to the field of time series analysis. As research continues, the principles established by PatchTST will likely inspire further advancements and optimizations in Transformer architectures.