Easy attention: A simple attention mechanism for temporal predictions with transformers (2308.12874v3)
Abstract: To improve the robustness of transformer neural networks used for temporal-dynamics prediction of chaotic systems, we propose a novel attention mechanism called easy attention which we demonstrate in time-series reconstruction and prediction. While the standard self attention only makes use of the inner product of queries and keys, it is demonstrated that the keys, queries and softmax are not necessary for obtaining the attention score required to capture long-term dependencies in temporal sequences. Through the singular-value decomposition (SVD) on the softmax attention score, we further observe that self attention compresses the contributions from both queries and keys in the space spanned by the attention score. Therefore, our proposed easy-attention method directly treats the attention scores as learnable parameters. This approach produces excellent results when reconstructing and predicting the temporal dynamics of chaotic systems exhibiting more robustness and less complexity than self attention or the widely-used long short-term memory (LSTM) network. We show the improved performance of the easy-attention method in the Lorenz system, a turbulence shear flow and a model of a nuclear reactor.
- I. Mezić and A. Banaszuk, Comparison of systems with complex behavior, Physica D: Nonlinear Phenomena 197, 101 (2004).
- S. Pan and K. Duraisamy, On the structure of time-delay embedding in linear models of non-linear dynamical systems, Chaos: An Interdisciplinary Journal of Nonlinear Science 30, 073135 (2020a).
- A. Towne, O. T. Schmidt, and T. Colonius, Spectral proper orthogonal decomposition and its relationship to dynamic mode decomposition and resolvent analysis, Journal of Fluid Mechanics 847, 821–867 (2018).
- P. J. Schmid, Dynamic mode decomposition of numerical and experimental data, Journal of Fluid Mechanics 656, 5–28 (2010).
- S. Le Clainche and J. M. Vega, Higher order dynamic mode decomposition, SIAM Journal on Applied Dynamical Systems 16, 882 (2017).
- I. Mezić, Spectral properties of dynamical systems, model reduction and decompositions, Nonlinear Dynamics 41, 309 (2005).
- B. O. Koopman and J. v. Neumann, Dynamical systems of continuous spectra, Proceedings of the National Academy of Sciences 18, 255 (1932).
- S. Pan and K. Duraisamy, Physics-informed probabilistic learning of linear embeddings of nonlinear dynamics with guaranteed stability, SIAM Journal on Applied Dynamical Systems 19, 480–509 (2020b).
- B. Lusch, J. N. Kutz, and S. L. Brunton, Deep learning for universal linear embeddings of nonlinear dynamics, Nature communications 9, 4950 (2018).
- M. A. Khodkar, P. Hassanzadeh, and A. Antoulas, A Koopman-based framework for forecasting the spatiotemporal evolution of chaotic dynamics with nonlinearities modeled as exogenous forcings (2019), arXiv:1909.00076 .
- G. Box and G. M. Jenkins, Time series analysis forecasting and control, 4th ed. (Wiley, 2008).
- I. Goodfellow, Y. Bengio, and A. Courville, Deep learning (MIT Press, 2016).
- S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation 9, 1735 (1997).
- S. Bai, J. Z. Kolter, and V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling (2018), arXiv:1803.01271 .
- N. Geneva and N. Zabaras, Transformers for modeling physical systems, Neural Networks 146, 272 (2022).
- A. Solera-Rico, C. Sanmiguel Vila, and M. e. a. Gómez-López, β𝛽\betaitalic_β-variational autoencoders and transformers for reduced-order modelling of fluid flows, Nat Commun 15, 1361 (2024).
- N. Kitaev, Łukasz Kaiser, and A. Levskaya, Reformer: The efficient transformer (2020), arXiv:2001.04451 .
- T. Dao, Flashattention-2: Faster attention with better parallelism and work partitioning (2023), arXiv:2307.08691 .
- L. Chierchia, Kolmogorov–arnold–moser (kam) theory, in Encyclopedia of Complexity and Systems Science, edited by R. A. Meyers (Springer New York, New York, NY, 2009) p. 5064–5091.
- D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, arXiv preprint arXiv:1409.0473 (2014).
- J. Moehlis, H. Faisst, and B. Eckhardt, A low-dimensional model for turbulent shear flows, New Journal of Physics 6, 56 (2004).
- E. N. Lorenz, Deterministic nonperiodic flow, Journal of Atmospheric Sciences 20, 130 (1963).
- J. Wayland, C. Coupette, and B. Rieck, Mapping the multiverse of latent representations (2024), arXiv:2402.01514 [cs.LG] .
- G. Locatelli, M. Mancini, and N. Todeschini, Generation iv nuclear reactors: Current status and future prospects, Energy Policy 61, 1503–1520 (2013).
- I. Mezić, Operator is the model (2023), arXiv:2310.18516 [math.DS] .
- K. J. Palmer, Shadowing lemma for flows, Scholarpedia 4, 7918 (2009), revision #91761.
- J. Thickstun, The transformer model in equations, johnthickstun.com (2020).
- Marcial Sanchis-Agudo (1 paper)
- Yuning Wang (20 papers)
- Roger Arnau (4 papers)
- Luca Guastoni (9 papers)
- Jasmin Lim (1 paper)
- Karthik Duraisamy (61 papers)
- Ricardo Vinuesa (95 papers)