Understanding Forgetting in Continual Learning with Linear Regression (2405.17583v1)
Abstract: Continual learning, focused on sequentially learning multiple tasks, has gained significant attention recently. Despite the tremendous progress made in the past, the theoretical understanding, especially factors contributing to catastrophic forgetting, remains relatively unexplored. In this paper, we provide a general theoretical analysis of forgetting in the linear regression model via Stochastic Gradient Descent (SGD) applicable to both underparameterized and overparameterized regimes. Our theoretical framework reveals some interesting insights into the intricate relationship between task sequence and algorithmic parameters, an aspect not fully captured in previous studies due to their restrictive assumptions. Specifically, we demonstrate that, given a sufficiently large data size, the arrangement of tasks in a sequence, where tasks with larger eigenvalues in their population data covariance matrices are trained later, tends to result in increased forgetting. Additionally, our findings highlight that an appropriate choice of step size will help mitigate forgetting in both underparameterized and overparameterized settings. To validate our theoretical analysis, we conducted simulation experiments on both linear regression models and Deep Neural Networks (DNNs). Results from these simulations substantiate our theoretical findings.
- Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pp. 139–154, 2018.
- Statistical mechanical analysis of catastrophic forgetting in continual learning with teacher and student networks. Journal of the Physical Society of Japan, 90(10):104001, 2021.
- Generalisation guarantees for continual learning with orthogonal gradient descent. arXiv preprint arXiv:2006.11942, 2020.
- Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
- Dimension independent generalization error by stochastic gradient descent. arXiv preprint arXiv:2003.11196, 2020.
- Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103–126, 2014.
- Adaptation based on generalized discrepancy. The Journal of Machine Learning Research, 20(1):1–30, 2019.
- Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In Artificial Intelligence and Statistics, pp. 205–213. PMLR, 2015.
- Harder, better, faster, stronger convergence rates for least-squares regression. The Journal of Machine Learning Research, 18(1):3520–3570, 2017.
- A theoretical analysis of catastrophic forgetting through the ntk overlap matrix. In International Conference on Artificial Intelligence and Statistics, pp. 1072–1080. PMLR, 2021.
- How catastrophic can catastrophic forgetting be in linear regression? In Conference on Learning Theory, pp. 4028–4079. PMLR, 2022.
- Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pp. 3762–3773. PMLR, 2020.
- On the value of target data in transfer learning, 2020.
- Bilevel coreset selection in continual learning: A new formulation and algorithm. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- A markov chain theory approach to characterizing the minimax optimality of stochastic gradient descent (for least squares). arXiv preprint arXiv:1710.09430, 2017.
- Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. Journal of machine learning research, 18, 2018.
- Karczmarz, S. Angenaherte auflosung von systemen linearer glei-chungen. Bull. Int. Acad. Pol. Sic. Let., Cl. Sci. Math. Nat., pp. 355–357, 1937.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
- Marginal singularity, and the benefits of labels in covariate-shift. In Conference On Learning Theory, pp. 1882–1886. PMLR, 2018.
- Continual learning in the teacher-student setup: Impact of task similarity. In International Conference on Machine Learning, pp. 6109–6119. PMLR, 2021.
- Trgp: Trust region gradient projection for continual learning. arXiv preprint arXiv:2202.02931, 2022.
- Theory on forgetting and generalization of continual learning. arXiv preprint arXiv:2302.05836, 2023.
- Continual learning with recursive gradient optimization. arXiv preprint arXiv:2201.12522, 2022.
- Optimally tackling covariate shift in rkhs-based nonparametric regression, 2023.
- Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pp. 109–165. Elsevier, 1989.
- New analysis and algorithm for learning with drifting distributions, 2012.
- A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
- A new similarity measure for covariate shift with applications to nonparametric regression. In International Conference on Machine Learning, pp. 17517–17530. PMLR, 2022.
- Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018.
- Gradient projection memory for continual learning. arXiv preprint arXiv:2103.09762, 2021.
- Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning, pp. 4548–4557. PMLR, 2018.
- Continual learning with deep generative replay. Advances in neural information processing systems, 30, 2017.
- Machine learning in non-stationary environments: Introduction to covariate shift adaptation. MIT press, 2012.
- Nearly optimal bounds for cyclic forgetting. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=X25L5AjHig.
- Last iterate risk bounds of sgd with decaying stepsize for overparameterized linear regression. In International Conference on Machine Learning, pp. 24280–24314. PMLR, 2022a.
- The power and limitation of pretraining-finetuning for linear regression under covariate shift. Advances in Neural Information Processing Systems, 35:33041–33053, 2022b.
- Grown: Grow only when necessary for continual learning. arXiv preprint arXiv:2110.00908, 2021.
- Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
- Scalable and order-robust continual learning with additive parameter decomposition. arXiv preprint arXiv:1902.09432, 2019.
- Benign overfitting of constant-stepsize sgd for linear regression. In Conference on Learning Theory, pp. 4633–4635. PMLR, 2021.