Understanding Different Design Choices in Training Large Time Series Models
Introduction
Time series forecasting (TSF) remains a fundamental task in time series analysis, focusing on predicting future data points based on historical values. Over the years, TSF methodologies have evolved from traditional statistical techniques to machine learning and more recently to deep learning. The advent of transformers, particularly the ability of these architectures to excel in sequential modeling, has led to their application in TSF tasks, especially for long-term forecasting.
Drawing inspiration from the capabilities of LLMs, researchers are now exploring Large Time Series Models (LTSMs) utilizing transformer-based architectures for TSF. However, training LTSMs presents unique challenges due to the heterogeneity in time series data. These challenges include variations in data frequencies, dimensions, and patterns, which complicate the training of LTSMs to generalize across diverse datasets.
This paper provides a comprehensive analysis of various design choices in training LTSMs, spanning pre-processing techniques, model configurations, and dataset configurations. Additionally, the authors propose a novel statistical prompting strategy called the "Time Series Prompt" and introduce an optimal combination of design choices termed the "LTSM-bundle". The empirical results demonstrate the superior performance of LTSM-bundle in zero-shot and few-shot settings compared to state-of-the-art LTSMs.
Methodology
Pre-processing: Instruction Prompts
The pre-processing step aims to enable LTSMs to better adapt to time series datasets. Two types of prompts are studied:
- Text Prompts: Task-specific information formatted into text.
- Time Series Prompts: A novel approach introduced in this paper. These prompts are generated by extracting statistical features from the training dataset, providing a robust statistical description of each dataset.
Results
Empirical results indicate that time series prompts outperform text prompts, yielding up to 8% lower MAE scores. Additionally, the use of time series prompts results in up to 3% lower MSE scores when compared to scenarios without prompts.
Pre-processing: Tokenizations
This section evaluates linear tokenization and time series tokenization approaches:
- Linear Tokenization: Involves using a trainable linear layer to convert time series numbers into tokens.
- Time Series Tokenization: Converts continuous time series data into discrete tokens using a trainable function.
Results
Linear tokenization proved more effective than time series tokenization in training LTSMs, leading to higher performance across diverse datasets.
Model Configuration: Training Paradigm
Three distinct training paradigms are compared:
- Fully Fine-tuning: Fine-tuning all parameters using pre-trained weights.
- Training from Scratch: Initializing all model parameters from scratch.
- LoRA Fine-tuning: Using low-rank adapters to fine-tune a limited number of parameters.
Results
Fully fine-tuning emerged as the most effective strategy, offering significantly lower MSE and MAE scores compared to training from scratch and LoRA fine-tuning.
Model Configuration: Base Model Selection
Four pre-trained models were evaluated as potential backbones for LTSMs:
- GPT-2-Small
- GPT-2-Medium
- GPT-2-Large
- Phi-2
Results
GPT-2-Medium and GPT-2-Small showed superior performance compared to GPT-2-Large, particularly in short-term and long-term forecasting scenarios respectively, suggesting these backbones are less prone to overfitting.
Dataset Configuration: Quantity and Diversity
The impact of data quantity and diversity on model performance was also examined:
- Data Quantity: Various down-sampling rates (10%, 5%, 2.5%) were compared.
- Diversity: Models were trained on an increasing number of datasets to evaluate performance improvements.
Results
Using 5% of the training data generally provided the best balance for model granularity and performance. Increasing dataset diversity consistently improved model generalizability.
Comparison with State-of-the-Art Methods
The LTSM-bundle demonstrated superior performance across various benchmarks in zero-shot and few-shot settings. Notably, LTSM-bundle outperformed numerous state-of-the-art models like PatchTST, DLinear, and others.
Conclusion and Future Directions
This paper provides an in-depth analysis of critical design choices in training LTSMs, yielding insights that culminate in the LTSM-bundle. This framework exhibits strong performance with enhanced generalizability and efficiency.
Future work might involve developing more nuanced prompting strategies and exploring synthetic datasets to further enhance LTSMs. Additionally, investigating variate-specific prompts and the integration of more complex statistical descriptions could yield further improvements.
Overall, this work lays substantial groundwork for advancing the field of time series forecasting using large-scale, transformer-based models.