Transformers in Time-series Analysis: A Tutorial (2205.01138v2)
Abstract: Transformer architecture has widespread applications, particularly in Natural Language Processing and computer vision. Recently Transformers have been employed in various aspects of time-series analysis. This tutorial provides an overview of the Transformer architecture, its applications, and a collection of examples from recent research papers in time-series analysis. We delve into an explanation of the core components of the Transformer, including the self-attention mechanism, positional encoding, multi-head, and encoder/decoder. Several enhancements to the initial, Transformer architecture are highlighted to tackle time-series tasks. The tutorial also provides best practices and techniques to overcome the challenge of effectively training Transformers for time-series analysis.
- Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
- A survey of transformers. AI Open, 3:111–132, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. Preprint at https://arxiv.org/abs/2010.11929.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023. Preprint at https://arxiv.org/abs/2304.02643.
- Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks. arXiv preprint arXiv:2206.08916, 2022. Preprint at https://arxiv.org/abs/2206.08916.
- Decision transformer: Reinforcement learning via sequence modeling. Advances in Neural Information Processing Systems, 34:15084–15097, 2021.
- Multimodal Data Integration for Oncology in the Era of Deep Neural Networks: A Review. arXiv preprint arXiv:2303.06471, 2023. Preprint at https://arxiv.org/abs/2303.06471.
- Tabular transformers for modeling multivariate time series. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3565–3569, Toronto, 2021. IEEE. https://doi.org/10.1109/ICASSP39728.2021.9414142.
- A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019, 2015. Preprint at https://arxiv.org/abs/1506.00019.
- Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 11 1997.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014. Preprint at https://arxiv.org/abs/1412.3555.
- Trustworthy uncertainty propagation for sequential time-series analysis in rnns. IEEE Transactions on Knowledge and Data Engineering, pages 1–13, 2023.
- Gang Chen. A gentle tutorial of recurrent neural network with error backpropagation. arXiv preprint arXiv:1610.02583, 2016. Preprint at https://arxiv.org/abs/1610.02583.
- Forecasting of COVID-19 using deep layer recurrent neural networks (RNNs) with gated recurrent units (GRUs) and long short-term memory (LSTM) cells. Chaos, Solitons & Fractals, 146:110861, 2021.
- Recurrent neural networks for multivariate time series with missing values. Scientific reports, 8(1):6085, 2018.
- Recurrent neural networks approach to the financial forecast of google assets. International journal of Mathematics and Computers in simulation, 11:7–13, 2017.
- Performance evaluation of deep neural networks applied to speech recognition: RNN, LSTM and GRU. Journal of Artificial Intelligence and Soft Computing Research, 9(4):235–245, 2019.
- Financial forecasting with α𝛼\alphaitalic_α-rnns: A time series modeling approach. Frontiers in Applied Mathematics and Statistics, 6:551138, 2021.
- Bidirectional grid long short-term memory (bigridlstm): A method to address context-sensitivity and vanishing gradient. Algorithms, 11(11):172, 2018.
- Beyond exploding and vanishing gradients: analysing rnn training using attractors and smoothness. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 2370–2380. PMLR, 26–28 Aug 2020.
- An optimized parallel implementation of non-iteratively trained recurrent neural networks. Journal of Artificial Intelligence and Soft Computing Research, 11(1):33–50, 2021.
- Alex Sherstinsky. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network. Physica D: Nonlinear Phenomena, 404:132306, 2020.
- Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
- Reformer: The efficient transformer. 2020. Preprint at https://arxiv.org/abs/2001.04451.
- Motivation for and Evaluation of the First Tensor Processing Unit. IEEE Micro, 38(3):10–19, 2018.
- On the difficulty of training recurrent neural networks. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1310–1318, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
- Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787, 2019. Preprint at https://arxiv.org/abs/1906.01787.
- Training deeper neural machine translation models with transparent attention. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3028–3033, Brussels, Belgium, October-November 2018. Association for Computational Linguistics.
- Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In The Eleventh International Conference on Learning Representations, 2023.
- A time series is worth 64 words: Long-term forecasting with transformers. arXiv preprint arXiv:2211.14730, 2022. Preprint at https://arxiv.org/abs/2211.14730.
- Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In International conference on learning representations, 2021.
- FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 27268–27286. PMLR, 17–23 Jul 2022.
- Learning to rotate: Quaternion transformer for complicated periodical time series forecasting. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, page 146–156, New York, NY, USA, 2022. Association for Computing Machinery.
- Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019. Preprint at https://arxiv.org/abs/1904.00962.
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Knowledge inheritance for pre-trained language models. arXiv preprint arXiv:2105.13880, 2021. Preprint at https://arxiv.org/abs/2105.13880.
- Automated progressive learning for efficient training of vision transformers. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12476–12486, Los Alamitos, CA, USA, jun 2022. IEEE Computer Society.
- Learning to grow pretrained models for efficient transformer training. arXiv preprint arXiv:2303.00980, 2023. Preprint at https://arxiv.org/abs/2303.00980.
- Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019. Preprint at https://arxiv.org/abs/1912.12180.
- Efficient Content-Based Sparse Attention with Routing Transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 02 2021.
- Rethinking Attention with Performers. arXiv preprint arXiv:2009.14794, 2020. Preprint at https://arxiv.org/abs/2009.14794.
- A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120:70–83, 2018.
- Memorize or generalize? searching for a compositional rnn in a haystack. arXiv preprint arXiv:1802.06467, 2018. Preprint at https://arxiv.org/abs/1802.06467.
- Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5156–5165. PMLR, 13–18 Jul 2020.
- Modeling recurrence for transformer. arXiv preprint arXiv:1904.03092, 2019. Preprint at https://arxiv.org/abs/1904.03092.
- R-Transformer: Recurrent Neural Network Enhanced Transformer. arXiv preprint arXiv:1907.05572, 2019. Preprint at https://arxiv.org/abs/1907.05572.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. Preprint at https://arxiv.org/abs/1301.3781.
- Gerard Salton. Some Experiments in the Generation of Word and Document Associations. In Proceedings of the December 4-6, 1962, Fall Joint Computer Conference, AFIPS ’62 (Fall), page 234–250, New York, NY, USA, 1962. Association for Computing Machinery.
- GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), volume 31, pages 1532–1543, Stroudsburg, PA, USA, 2014. Association for Computational Linguistics.
- Deep contextualized word representations. NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1:2227–2237, 2018.
- Convolutional sequence to sequence learning. 34th International Conference on Machine Learning, ICML 2017, 3:2029–2042, 2017.
- Jesse Vig. BertViz, 2022. https://github.com/jessevig/bertviz [Online; accessed 5-May-2022].
- Understanding of a convolutional neural network. In 2017 International Conference on Engineering and Technology (ICET), pages 1–6. IEEE, 2017.
- Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June 2015.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016. Preprint at https://arxiv.org/abs/1607.06450.
- Abien Fred Agarap. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375, 2018. Preprint at https://arxiv.org/abs/1803.08375.
- Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069, 2018. Preprint at https://arxiv.org/abs/1812.05069.
- Positional encoding to control output sequence length. arXiv preprint arXiv:1904.07418, 2019. Preprint at https://arxiv.org/abs/1904.07418.
- Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 538–547, 2021.
- Attend and Diagnose: Clinical Time Series Analysis Using Attention Models. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018.
- Rethinking Positional Encoding. arXiv preprint arXiv:2107.02561, 2021. Preprint at https://arxiv.org/abs/2107.02561.
- Self-supervised pretraining of transformers for satellite image time series classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14:474–487, 2020.
- Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy. arXiv preprint arXiv:2110.02642, 2021. Preprint at https://arxiv.org/abs/2110.02642.
- Self-supervised Transformer for Multivariate Clinical Time-Series with Missing Values. arXiv preprint arXiv:2107.14293, 2021.
- S. M. Shankaranarayana and D. Runje. Attention Augmented Convolutional Transformer for Tabular Time-series. pages 537–541. IEEE Computer Society, 2021.
- Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting, 37(4):1748–1764, 2021.
- Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting. Transactions in GIS, 24(3):736–755, 2020.
- Gated Transformer Networks for Multivariate Time Series Classification. arXiv preprint arXiv:2103.14438, 2021. Preprint at https://arxiv.org/abs/2103.14438.
- Language Modeling with Gated Convolutional Networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 933–941. PMLR, 06–11 Aug 2017.
- Li Shen and Yangzhu Wang. TCCT: Tightly-coupled convolutional transformer on time series forecasting. Neurocomputing, 480:131–145, 2022.
- CSPNet: A New Backbone that can Enhance Learning Capability of CNN. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1571–1580. IEEE, 2020.
- NAST: non-autoregressive spatial-temporal transformer for time series forecasting. arXiv preprint arXiv:2102.05624, 2021. Preprint at https://arxiv.org/abs/2102.05624.
- Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. volume 35, pages 11106–11115, May 2021.
- Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Yformer: U-Net Inspired Transformer Architecture for Far Horizon Time Series Forecasting. arXiv preprint arXiv:2110.08255, 2021. Preprint at https://arxiv.org/abs/2110.08255.
- Robust explainability: A tutorial on gradient-based attribution methods for deep neural networks. IEEE Signal Processing Magazine, 39(4):73–84, 2022.
- BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1(Mlm):4171–4186, 2019.
- From Known to Unknown: Knowledge-guided Transformer for Time-Series Sales Forecasting in Alibaba. arXiv preprint arXiv:2109.08381, 2021. Preprint at https://arxiv.org/abs/2109.08381.
- UCI Machine Learning Repository, 2017. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml.
- MIMIC-III, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
- Adversarial Sparse Transformer for Time Series Forecasting. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17105–17115. Curran Associates, Inc., 2020.
- Deep transformer models for time series forecasting: The influenza prevalence case. arXiv preprint arXiv:2001.08317, 2020. Preprint at https://arxiv.org/abs/2001.08317.
- Dilated Residual Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 636–644, Los Alamitos, CA, USA, jul 2017. IEEE Computer Society.
- Dilated convolutions for modeling long-distance genomic dependencies. arXiv preprint arXiv:1710.01278, 2017. Preprint at https://arxiv.org/abs/1710.01278.
- Stock price prediction using the arima model. In 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation, pages 106–112. IEEE, 2014.
- Forecasting at Scale. The American Statistician, 72(1):37–45, 2018.
- Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473, 2014. Preprint at https://arxiv.org/abs/1409.0473.
- Modeling long- and short-term temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 95–104, New York, NY, USA, 2018. Association for Computing Machinery.
- DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020.
- The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
- Hongxia Yang. AliGraph: A Comprehensive Graph Neural Network Platform. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, page 3165–3166, New York, NY, USA, 2019. Association for Computing Machinery.
- Big data and its technical challenges. Communications of the ACM, 57(7):86–94, 2014.
- Self-attention for raw optical Satellite Time Series Classification. ISPRS Journal of Photogrammetry and Remote Sensing, 169:421–435, 2020.
- Time series classification from scratch with deep neural networks: A strong baseline. In 2017 International joint conference on neural networks (IJCNN), pages 1578–1585. IEEE, 2017.
- DuPLO: A DUal view Point deep Learning architecture for time series classificatiOn. ISPRS Journal of Photogrammetry and Remote Sensing, 149:91–104, 2019.
- Temporal Convolutional Neural Network for the Classification of Satellite Image Time Series. Remote Sensing, 11(5), 2019.
- Nonlinear forecasting as a way of distinguishing chaos from measurement error in time series. Nature, 344(6268):734–741, 1990.
- On Layer Normalization in the Transformer Architecture. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10524–10533. PMLR, 13–18 Jul 2020.
- Understanding the difficulty of training transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5747–5763, Online, November 2020. Association for Computational Linguistics.
- ReZero is all you need: fast convergence at large depth. In Cassio de Campos and Marloes H. Maathuis, editors, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161 of Proceedings of Machine Learning Research, pages 1352–1361. PMLR, 27–30 Jul 2021.
- Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers. In III, Hal Daumé and Singh, Aarti, editor, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5958–5968. PMLR, 13–18 Jul 2020.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018. Preprint at https://arxiv.org/abs/1803.03635.
- Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.
- Improving Transformer Optimization Through Better Initialization. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4475–4483. PMLR, 13–18 Jul 2020.
- Optimizing deeper transformers on small datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2089–2102, Online, August 2021. Association for Computational Linguistics.
- Training Tips for the Transformer Model. The Prague Bulletin of Mathematical Linguistics, 110(1):43–70, Apr 2018.
- Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
- Leslie N Smith. A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820, 2018. Preprint at https://arxiv.org/abs/1212.5701.
- On adaptive learning rate that guarantees convergence in feedforward networks. IEEE Transactions on Neural Networks, 17(5):1116–1125, 2006.
- Exploring robust architectures for deep artificial neural networks. Communications Engineering, 1(1):46, 2022.
- Failure detection in deep neural networks for medical imaging. Frontiers in Medical Technology, 4, 2022.