Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts (2409.16040v4)

Published 24 Sep 2024 in cs.LG and cs.AI

Abstract: Deep learning for time series forecasting has seen significant advancements over the past decades. However, despite the success of large-scale pre-training in language and vision domains, pre-trained time series models remain limited in scale and operate at a high cost, hindering the development of larger capable forecasting models in real-world applications. In response, we introduce Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity. This allows Time-MoE to scale effectively without a corresponding increase in inference costs. Time-MoE comprises a family of decoder-only transformer models that operate in an auto-regressive manner and support flexible forecasting horizons with varying input context lengths. We pre-trained these models on our newly introduced large-scale data Time-300B, which spans over 9 domains and encompassing over 300 billion time points. For the first time, we scaled a time series foundation model up to 2.4 billion parameters, achieving significantly improved forecasting precision. Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position Time-MoE as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces Time-MoE, a scalable Mixture of Experts model that reduces computational overhead by activating only a fraction of its components per prediction.
The model employs sparse MoE layers and multi-resolution forecasting heads, pre-trained on the Time-300B dataset, to achieve a 23% average MSE reduction in zero-shot settings.
Empirical results reveal that minimal fine-tuning yields a 25% average MSE reduction, underscoring the power of large-scale pre-training for robust time series forecasting.

Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

"Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts" addresses fundamental challenges in the domain of time series forecasting by proposing a scalable framework for pre-training large models effectively. The authors introduce Time-MoE, a heterogeneous Mixture of Experts (MoE) architecture that activates only a fraction of its components per prediction, establishing a combination of high model capacity and computational efficiency.

The core advancements span several dimensions: architectural innovation, dataset construction, and empirical validation. The mixture-of-experts design ensures that despite the model's large scale, computational overhead is curbed effectively. By activating only necessary sub-networks per input, Time-MoE achieves superior performance without the typical computational expense.

Architecture and Methodology

The architecture of Time-MoE is prominently transformer-based, comprising a decoder-only structure that ingests temporal data auto-regressively. This design choice strikes a balance between predictive accuracy and computational efficiency. Key architectural features include:

Sparse MoE Layers: Instead of dense FFNs, Time-MoE employs sparsely activated MoE layers, ensuring scalability up to 2.4 billion parameters while maintaining manageable computational costs. Empirical results confirm that activating fewer experts per layer improves model efficiency significantly.
Multi-resolution forecasting heads: To enhance flexibility and generality in forecasting, Time-MoE introduces multiple output projections corresponding to different time horizons. Such design permits various forecast lengths dynamically, showcasing robustness across diverse temporal scales.

Data-driven Scaling and Model Efficiency

The critical foundation of this work is the construction and utilization of Time-300B, a comprehensive time series dataset boasting over 300 billion data points. This dataset spans numerous domains, allowing extensive pre-training and fine-tuning of Time-MoE. Compared to peers, including Moirai and Chronos, which maxed at several billion data points, Time-300B is unparalleled in scale.

To ensure high data quality, the authors implemented an elaborate data-cleaning pipeline, emphasizing the mitigation of missing values and invalid observations for robust training. This preprocessing pipeline enhances the quality and usability of input data, thus improving overall model performance.

Empirical Results

The paper rigorously evaluates Time-MoE across six time series benchmarks in both zero-shot and fine-tuning scenarios. A series of baselines, including state-of-the-art models like iTransformer, TimesNet, and PatchTST, frame the effectiveness of Time-MoE.

Zero-shot Forecasting: Time-MoE consistently demonstrates superior performance, reducing MSE by over 23% on average across benchmarks like ETTh1, ETTh2, and Electricity. This suggests that the MoE design confers significant generalization capabilities, aligning with the premise that larger and well-trained models can perform universally without task-specific tuning.
In-domain Forecasting: Fine-tuning results confirm Time-MoE's strong in-distribution performance. A notable 25% average MSE reduction underscores the efficacy of minimal fine-tuning epochs, indicating that extensive pre-training on diverse datasets imparts strong initial predictive capabilities.

Future Prospects and Implications

Time-MoE sets the stage for future advances in time series forecasting by establishing a roadmap for large-scale model pre-training and application. The empirical success of MoE layers and multi-resolution forecasting heads within Time-MoE highlights several forward-looking avenues:

Extending Sparsity Techniques: Further exploration into sparsity, encompassing diverse MoE configurations, and dynamic routing can push efficiency boundaries while scaling models even further.
Enhanced Data Utilization: Expanding datasets like Time-300B across different temporal, spatial, and domain-specific dimensions can improve the robustness and applicability of foundational models.
Adaptive Forecasting: Future models may implement adaptive mechanisms to dynamically adjust their computational footprint based on predictive uncertainty, ensuring real-time efficiency for practical applications.

Conclusion

Time-MoE epitomizes a significant leap in time series forecasting, marked by an efficient, scalable MoE architecture pre-trained on an unprecedented dataset. It establishes new benchmarks in both zero-shot and fine-tuned scenarios, offering a robust framework that balances computational efficiency with high capacity. The successful deployment and empirical validation of Time-MoE endorse its applicability across a wider spectrum of real-world temporal prediction challenges. Future advancements will likely build on this foundation, continually refining the interaction between large-scale models, sparse computation, and diverse data utilization.

PDF Markdown

Related Papers

Tweets

https://twitter.com/papers_anon/status/1838772168106348809

https://twitter.com/arXivGPT/status/1839411941602705577

https://twitter.com/TheTuringPost/status/1841617658493784244

https://twitter.com/gm8xx8/status/1838772752649740424

https://twitter.com/AleefMahmud/status/1841931228196516140

YouTube

Show All Videos