Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling-laws for Large Time-series Models (2405.13867v1)

Published 22 May 2024 in cs.LG and cs.AI

Abstract: Scaling laws for LLMs have provided useful guidance on how to train ever larger models for predictable performance gains. Time series forecasting shares a similar sequential structure to language, and is amenable to large-scale transformer architectures. Here we show that foundational decoder-only time series transformer models exhibit analogous scaling-behavior to LLMs, while architectural details (aspect ratio and number of heads) have a minimal effect over broad ranges. We assemble a large corpus of heterogenous time series data on which to train, and establish, for the first time, power-law scaling relations with respect to parameter count, dataset size, and training compute, spanning five orders of magnitude.

Overview of the Paper: Scaling Laws for Large Time-series Models

The paper "Scaling-laws for Large Time-series Models," authored by Thomas D. P. Edwards et al., addresses the subject of large-scale models for time-series forecasting. It aims to extend the scaling laws known from LLMs to foundational time-series models. The authors present a detailed investigation into the scaling behaviors concerning model parameters, dataset size, and computation resources in relation to the test performance of large time-series transformers.

Key Insights and Methodology

A crucial insight from this work is that time-series models based on decoder-only transformer architectures exhibit power-law scaling behaviors similar to those of LLMs. The researchers leverage a large, heterogeneous corpus of worldwide data to conduct their experiments. This corpus, which comprises about 8 billion data points across over 30 million individual time-series, allows for a comprehensive investigation of scaling behaviors across five orders of magnitude.

Key parameters in their analysis include the number of parameters in the model, the compute resources allocated for training, and the size of the dataset. Their findings suggest a consistent power-law scaling of performance metrics—Mean-Square Error (MSE), Continuous Ranked Probability Score (CRPS), and log-likelihood—with these factors. The work highlights how models improve with increased size and computational availability, a consistency also seen in LLMs.

Experimental Framework

The authors use a decoder-only transformer with a learned positional encoding and a Student's-t distribution head, specifically designed for probabilistic forecasting tasks. They opt for a negative log-likelihood loss function during training. What stands out is their systematic investigation of the optimal architecture settings, such as aspect ratio and number of attention heads, finding these to have minimal impact on performance compared to parameters count.

The empirical results demonstrated in the paper include detailed plots of the scaling behavior with respect to model parameters, compute resources, and dataset size. Particularly notable is the observation that the performance metrics follow power-law behavior, albeit with minor deviations at lower scales.

Practical and Theoretical Implications

From a practical standpoint, these scaling laws serve as pivotal guidelines for the allocation of resources in the development of large-scale time-series models. The foundational models capable of zero-shot prediction across various domains underline the potential to replace traditional statistical or domain-specific models in certain scenarios.

Theoretically, this research contributes to the broader understanding of neural scaling laws beyond natural language processing. It paves the way for further explorations into how foundational time-series models can be optimized for performance and scalability.

Future Directions

The paper acknowledges the need for expanding research to include multivariate time-series predictions and longer context lengths to better capture low-frequency variations. Furthermore, the authors underscore the desire to explore alternative distribution heads and context-length scaling to further improve model performance.

A notable prospective research avenue is the development of a robust framework for assessing data diversity, which the authors highlight as a critical factor influencing the efficacy of large-scale training.

Conclusion

This work by Edwards et al. provides a thorough examination of the scaling laws relevant to large time-series models, paralleling those observed in LLMs. It emphasizes the viability of employing foundational models in time-series forecasting, fostering advancements in AI-driven decision-making across diverse fields like climate science, healthcare, and finance. As the field progresses, the findings in this paper will likely guide subsequent efforts to refine and implement large-scale time-series forecasting models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. K. P. Körding and D. M. Wolpert, “Bayesian integration in sensorimotor learning,” Nature, vol. 427, no. 6971, pp. 244–247, 2004.
  2. K. Doya, Bayesian brain: Probabilistic approaches to neural coding. MIT press, 2007.
  3. K. Doya, “Modulators of decision making,” Nature neuroscience, vol. 11, no. 4, pp. 410–416, 2008.
  4. A. Funamizu, B. Kuhn, and K. Doya, “Neural substrate of dynamic bayesian inference in the cerebral cortex,” Nature neuroscience, vol. 19, no. 12, pp. 1682–1689, 2016.
  5. C. Lindig-León, N. Kaur, and D. A. Braun, “From bayes-optimal to heuristic decision-making in a two-alternative forced choice task with an information-theoretic bounded rationality model,” Frontiers in Neuroscience, vol. 16, p. 906198, 2022.
  6. M. West and J. Harrison, Bayesian Forecasting and Dynamic Models. Springer Series in Statistics, Springer New York, 2013.
  7. R. J. Hyndman and G. Athanasopoulos, Forecasting: principles and practice. OTexts, 2018.
  8. J. F. Torres, D. Hadjout, A. Sebaa, F. Martínez-Álvarez, and A. Troncoso, “Deep learning for time series forecasting: a survey,” Big Data, vol. 9, no. 1, pp. 3–21, 2021.
  9. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  10. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  11. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  12. H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024.
  13. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  14. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021.
  15. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International conference on machine learning, pp. 8821–8831, Pmlr, 2021.
  16. W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas, “Videogpt: Video generation using vq-vae and transformers,” arXiv preprint arXiv:2104.10157, 2021.
  17. A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 6836–6846, 2021.
  18. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022.
  19. J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International conference on machine learning, pp. 19730–19742, PMLR, 2023.
  20. A. Das, W. Kong, R. Sen, and Y. Zhou, “A decoder-only foundation model for time-series forecasting,” arXiv e-prints, p. arXiv:2310.10688, Oct. 2023.
  21. M. Goswami, K. Szafer, A. Choudhry, Y. Cai, S. Li, and A. Dubrawski, “MOMENT: A Family of Open Time-series Foundation Models,” arXiv e-prints, p. arXiv:2402.03885, Feb. 2024.
  22. K. Rasul, A. Ashok, A. R. Williams, A. Khorasani, G. Adamopoulos, R. Bhagwatkar, M. Biloš, H. Ghonia, N. V. Hassen, A. Schneider, S. Garg, A. Drouin, N. Chapados, Y. Nevmyvaka, and I. Rish, “Lag-Llama: Towards Foundation Models for Time Series Forecasting,” arXiv e-prints, p. arXiv:2310.08278, Oct. 2023.
  23. A. Garza and M. Mergenthaler-Canseco, “TimeGPT-1,” arXiv e-prints, p. arXiv:2310.03589, Oct. 2023.
  24. Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam, “A Time Series is Worth 64 Words: Long-term Forecasting with Transformers,” arXiv e-prints, p. arXiv:2211.14730, Nov. 2022.
  25. G. Woo, C. Liu, A. Kumar, C. Xiong, S. Savarese, and D. Sahoo, “Unified Training of Universal Time Series Forecasting Transformers,” arXiv e-prints, p. arXiv:2402.02592, Feb. 2024.
  26. G. Woo, C. Liu, A. Kumar, and D. Sahoo, “Pushing the Limits of Pre-training for Time Series Forecasting in the CloudOps Domain,” arXiv e-prints, p. arXiv:2310.05063, Oct. 2023.
  27. W. Xue, T. Zhou, Q. Wen, J. Gao, B. Ding, and R. Jin, “CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting,” arXiv e-prints, p. arXiv:2305.12095, May 2023.
  28. R. Ilbert, A. Odonnat, V. Feofanov, A. Virmaux, G. Paolo, T. Palpanas, and I. Redko, “Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention,” arXiv e-prints, p. arXiv:2402.10198, Feb. 2024.
  29. D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski, “Deepar: Probabilistic forecasting with autoregressive recurrent networks,” International journal of forecasting, vol. 36, no. 3, pp. 1181–1191, 2020.
  30. B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio, “N-beats: Neural basis expansion analysis for interpretable time series forecasting,” arXiv preprint arXiv:1905.10437, 2019.
  31. B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio, “Meta-learning framework with applications to zero-shot time-series forecasting,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 9242–9250, 2021.
  32. N. Gruver, M. Finzi, S. Qiu, and A. G. Wilson, “Large language models are zero-shot time series forecasters,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  33. Q. Ma, Z. Liu, Z. Zheng, Z. Huang, S. Zhu, Z. Yu, and J. T. Kwok, “A survey on time-series pre-trained models,” arXiv preprint arXiv:2305.10716, 2023.
  34. A. Fatir Ansari, L. Stella, C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchur, S. Sundar Rangapuram, S. Pineda Arango, S. Kapoor, J. Zschiegner, D. C. Maddix, H. Wang, M. W. Mahoney, K. Torkkola, A. G. Wilson, M. Bohlke-Schneider, and Y. Wang, “Chronos: Learning the Language of Time Series,” arXiv e-prints, p. arXiv:2403.07815, Mar. 2024.
  35. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural Language Models,” arXiv e-prints, p. arXiv:2001.08361, Jan. 2020.
  36. M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning, pp. 6105–6114, PMLR, 2019.
  37. M. Tan and Q. Le, “Efficientnetv2: Smaller models and faster training,” in International conference on machine learning, pp. 10096–10106, PMLR, 2021.
  38. M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy, “Do vision transformers see like convolutional neural networks?,” Advances in neural information processing systems, vol. 34, pp. 12116–12128, 2021.
  39. C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Susano Pinto, D. Keysers, and N. Houlsby, “Scaling vision with sparse mixture of experts,” Advances in Neural Information Processing Systems, vol. 34, pp. 8583–8595, 2021.
  40. T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray, et al., “Scaling laws for autoregressive generative modeling,” arXiv preprint arXiv:2010.14701, 2020.
  41. H. Hersbach, “Decomposition of the continuous ranked probability score for ensemble prediction systems,” Weather and Forecasting, vol. 15, no. 5, pp. 559–570, 2000.
  42. R. Godahewa, C. Bergmeir, G. I. Webb, R. J. Hyndman, and P. Montero-Manso, “Monash time series forecasting archive,” in Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
  43. P. Emami, A. Sahu, and P. Graf, “Buildingsbench: A large-scale dataset of 900k buildings and benchmark for short-term load forecasting,” Advances in Neural Information Processing Systems, 2023.
  44. X. Liu, Y. Xia, Y. Liang, J. Hu, Y. Wang, L. Bai, C. Huang, Z. Liu, B. Hooi, and R. Zimmermann, “Largest: A benchmark dataset for large-scale traffic forecasting,” in Advances in Neural Information Processing Systems, 2023.
  45. P. Warden, “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition,” arXiv e-prints, p. arXiv:1804.03209, Apr. 2018.
  46. D. Stowell, M. D. Wood, H. Pamuła, Y. Stylianou, and H. Glotin, “Automatic acoustic detection of birds through deep learning: The first bird audio detection challenge,” Methods in Ecology and Evolution, vol. 10, no. 3, pp. 368–380, 2019.
  47. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  48. M. B. Bjerregård, J. K. Møller, and H. Madsen, “An introduction to multivariate probabilistic forecast evaluation,” Energy and AI, vol. 4, p. 100058, 2021.
  49. S. McCandlish, J. Kaplan, D. Amodei, and OpenAI Dota Team, “An Empirical Model of Large-Batch Training,” arXiv e-prints, p. arXiv:1812.06162, Dec. 2018.
  50. Gemini Team, “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv e-prints, p. arXiv:2403.05530, Mar. 2024.
  51. R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu, A. Merose, S. Hoyer, G. Holland, O. Vinyals, J. Stott, A. Pritzel, S. Mohamed, and P. Battaglia, “Learning skillful medium-range global weather forecasting,” Science, vol. 382, pp. 1416–1421, Dec. 2023.
  52. J. Pathak, S. Subramanian, P. Harrington, S. Raja, A. Chattopadhyay, M. Mardani, T. Kurth, D. Hall, Z. Li, K. Azizzadenesheli, P. Hassanzadeh, K. Kashinath, and A. Anandkumar, “FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators,” arXiv e-prints, p. arXiv:2202.11214, Feb. 2022.
  53. E. J. H. Wilson, A. Parker, A. Fontanini, E. Present, J. L. Reyna, R. Adhikari, C. Bianchi, C. CaraDonna, M. Dahlhausen, J. Kim, A. LeBar, L. Liu, M. Praprost, L. Zhang, P. DeWitt, N. Merket, A. Speake, T. Hong, H. Li, N. M. Frick, Z. Wang, A. Blair, H. Horsey, D. Roberts, K. Trenbath, O. Adekanye, E. Bonnema, R. El Kontar, J. Gonzalez, S. Horowitz, D. Jones, R. T. Muehleisen, S. Platthotam, M. Reynolds, J. Robertson, K. Sayers, and Q. Li, “End-use load profiles for the u.s. building stock: Methodology and results of model calibration, validation, and uncertainty quantification,” Technical Report, 3 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Thomas D. P. Edwards (25 papers)
  2. James Alvey (20 papers)
  3. Justin Alsing (37 papers)
  4. Nam H. Nguyen (21 papers)
  5. Benjamin D. Wandelt (144 papers)
Citations (3)
Youtube Logo Streamline Icon: https://streamlinehq.com