Transformers for Supervised Online Continual Learning (2403.01554v1)
Abstract: Transformers have become the dominant architecture for sequence modeling tasks such as natural language processing or audio processing, and they are now even considered for tasks that are not naturally sequential such as image classification. Their ability to attend to and to process a set of tokens as context enables them to develop in-context few-shot learning abilities. However, their potential for online continual learning remains relatively unexplored. In online continual learning, a model must adapt to a non-stationary stream of data, minimizing the cumulative nextstep prediction loss. We focus on the supervised online continual learning setting, where we learn a predictor $x_t \rightarrow y_t$ for a sequence of examples $(x_t, y_t)$. Inspired by the in-context learning capabilities of transformers and their connection to meta-learning, we propose a method that leverages these strengths for online continual learning. Our approach explicitly conditions a transformer on recent observations, while at the same time online training it with stochastic gradient descent, following the procedure introduced with Transformer-XL. We incorporate replay to maintain the benefits of multi-epoch training while adhering to the sequential protocol. We hypothesize that this combination enables fast adaptation through in-context learning and sustained longterm improvement via parametric learning. Our method demonstrates significant improvements over previous state-of-the-art results on CLOC, a challenging large-scale real-world benchmark for image geo-localization.
- The description length of deep learning models. Advances in Neural Information Processing Systems, 31, 2018.
- Sequential learning of neural networks for prequential MDL. In The Eleventh International Conference on Learning Representations, 2022.
- High-performance large-scale image recognition without normalization. In International Conference on Machine Learning, pp. 1059–1071. PMLR, 2021.
- Online continual learning with natural distribution shifts: An empirical study with visual data. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 8281–8290, 2021.
- Chaitin, G. J. On the intelligibility of the universe and the notions of simplicity, complexity and irreducibility. arXiv preprint math/0210035, 2002.
- Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
- Continual learning with tiny episodic memories. In Workshop on Multi-Task and Lifelong Reinforcement Learning, 2019.
- Emnist: Extending mnist to handwritten letters. 2017 International Joint Conference on Neural Networks (IJCNN), 2017. doi: 10.1109/ijcnn.2017.7966217.
- Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988, 2019.
- Prospective learning: Principled extrapolation to the future. In Conference on Lifelong Learning Agents, pp. 347–357. PMLR, 2023.
- Memory-based meta-learning on non-stationary distributions. In International Conference on Machine Learning, 2023.
- Shannon information and kolmogorov complexity. October 2004.
- Grünwald, P. D. The Minimum Description Length Principle. The MIT Press, Cambridge, 2007.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022.
- Recasting continual learning as sequence modeling. October 2023.
- Challenging common assumptions about catastrophic forgetting. 2022.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11976–11986, 2022.
- Understanding plasticity in neural networks. International Conference on Machine Learning, 2023.
- Meta-trained agents implement bayes-optimal agents. Advances in neural information processing systems, 33:18691–18703, 2020.
- In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
- Meta-learning of sequential strategies. arXiv preprint arXiv:1905.03030, 2019.
- Online continual learning without the storage constraint. arXiv preprint arXiv:2305.09253, 2023.
- A philosophical treatise of universal induction. Entropy, 13(6):1076–1136, 2011. ISSN 1099-4300. doi: 10.3390/e13061076.
- Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pp. 9355–9366. PMLR, 2021.
- Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
- The transient nature of emergent In-Context learning in transformers. November 2023.
- Kalman filter for online classification of non-stationary data. arXiv preprint arXiv:2306.08448, 2023.
- Vapnik, V. Principles of risk minimization for learning theory. In Proceedings of the 4th International Conference on Neural Information Processing Systems, NIPS’91, pp. 831–838, San Francisco, CA, USA, December 1991. Morgan Kaufmann Publishers Inc.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Wallace, C. S. Statistical and Inductive Inference by Minimum Message Length. Springer Science & Business Media, December 2005.
- Tuning large neural networks via zero-shot hyperparameter transfer. Advances in Neural Information Processing Systems, 34:17084–17097, 2021.
- Jorg Bornschein (22 papers)
- Yazhe Li (17 papers)
- Amal Rannen-Triki (9 papers)