Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers (2211.14730v2)

Published 27 Nov 2022 in cs.LG and cs.AI

Abstract: We propose an efficient design of Transformer-based models for multivariate time series forecasting and self-supervised representation learning. It is based on two key components: (i) segmentation of time series into subseries-level patches which are served as input tokens to Transformer; (ii) channel-independence where each channel contains a single univariate time series that shares the same embedding and Transformer weights across all the series. Patching design naturally has three-fold benefit: local semantic information is retained in the embedding; computation and memory usage of the attention maps are quadratically reduced given the same look-back window; and the model can attend longer history. Our channel-independent patch time series Transformer (PatchTST) can improve the long-term forecasting accuracy significantly when compared with that of SOTA Transformer-based models. We also apply our model to self-supervised pre-training tasks and attain excellent fine-tuning performance, which outperforms supervised training on large datasets. Transferring of masked pre-trained representation on one dataset to others also produces SOTA forecasting accuracy. Code is available at: https://github.com/yuqinie98/PatchTST.

Long-term Forecasting with Transformers: PatchTST

The paper "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers" introduces PatchTST, a novel Transformer-based model designed for multivariate time series forecasting. This paper provides a comprehensive approach that focuses on enhancing the Transformer model through the integration of segmentation and channel-independence mechanisms. The empirical results demonstrate significant improvements over state-of-the-art (SOTA) models, both in forecasting accuracy and computation efficiency.

Key Innovations

The authors propose two main advancements in Transformer-based time series models:

  1. Patching: The method involves segmenting time series data into subseries-level patches, which are then fed into the Transformer as input tokens. This technique retains local semantic information, significantly reduces computation and memory usage due to a decrease in the number of input tokens, and effectively extends the receptive field of the model.
  2. Channel-Independence: Each channel, representing a single univariate time series, shares the same embedding and Transformer weights across all series. This approach circumvents the complications associated with channel-mixing, wherein the attention network must learn from a combined input vector of all time series. Channel-independence allows each channel to have its own attention patterns, enhancing adaptability and reducing the risk of overfitting.

Empirical Results

The paper reports extensive experiments on eight popular datasets: Weather, Traffic, Electricity, Influenza-like Illness (ILI), and four ETT datasets. PatchTST is benchmarked against SOTA models like Informer, Autoformer, FEDformer, Pyraformer, and DLinear. Key findings from the experiments include:

  • Forecasting Accuracy: PatchTST/64 achieves an overall reduction of 21.0% in Mean Squared Error (MSE) compared to the best results from existing Transformer-based models and 20.2% when using PatchTST/42.
  • Efficiency: The patching technique dramatically decreases the time and space complexity of the model, with training times being reduced by up to 22 times on large datasets.
  • Look-back Window: PatchTST consistently benefits from longer look-back windows, evidenced by the reduction in MSE as the look-back window increases from 96 to 720.

Self-Supervised Representation Learning

The authors also explore the efficacy of PatchTST in self-supervised learning scenarios, leveraging masked autoencoder techniques. The model's ability to learn high-level abstract representations of time series data is validated through fine-tuning and transfer learning tasks. Key insights include:

  • Fine-tuning Performance: Self-supervised PatchTST outperforms training from scratch with fine-tuning, displaying superior results in most cases.
  • Transfer Learning: The model pre-trained on the Electricity dataset achieves commendable forecasting accuracy when transferred to other datasets, showcasing robustness and generalizability.
  • Comparison with Other Self-supervised Methods: PatchTST demonstrates substantial improvements over existing self-supervised learning methods such as BTSF, TS2Vec, TNC, and TS-TCC.

Channel-Independence Analysis

To better understand the effectiveness of channel-independence, several experiments highlight its impact:

  • Adaptability: Channel-independent models allow each time series to learn distinct attention patterns, which is crucial for datasets with series exhibiting different behaviors.
  • Training Efficiency and Overfitting: Channel-independent setups converge faster and are less prone to overfitting compared to channel-mixing approaches, particularly when the training data is limited.

Future Directions

The paper concludes with the potential for PatchTST to serve as a foundational model for future research in Transformer-based time series forecasting. The simplicity and effectiveness of patching are underscored, along with the possibility of extending channel-independence to incorporate inter-channel relationships.

Conclusion

PatchTST presents a significant step forward in the domain of time series forecasting using Transformers. By addressing the limitations of channel-mixing and leveraging the power of patches, it sets a new benchmark for accuracy and efficiency. The model's strong performance in both supervised and self-supervised settings, along with its robustness in transfer learning tasks, marks it as a critical contribution to the field of time series analysis. As research continues, the principles established by PatchTST will likely inspire further advancements and optimizations in Transformer architectures.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yuqi Nie (11 papers)
  2. Nam H. Nguyen (21 papers)
  3. Phanwadee Sinthong (4 papers)
  4. Jayant Kalagnanam (15 papers)
Citations (830)
Youtube Logo Streamline Icon: https://streamlinehq.com