Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Last Iterate Convergence of Incremental Methods and Applications in Continual Learning (2403.06873v2)

Published 11 Mar 2024 in math.OC and cs.LG

Abstract: Incremental gradient and incremental proximal methods are a fundamental class of optimization algorithms used for solving finite sum problems, broadly studied in the literature. Yet, without strong convexity, their convergence guarantees have primarily been established for the ergodic (average) iterate. Motivated by applications in continual learning, we obtain the first convergence guarantees for the last iterate of both incremental gradient and incremental proximal methods, in general convex smooth (for both) and convex Lipschitz (for the proximal variants) settings. Our oracle complexity bounds for the last iterate nearly match (i.e., match up to a square-root-log or a log factor) the best known oracle complexity bounds for the average iterate, for both classes of methods. We further obtain generalizations of our results to weighted averaging of the iterates with increasing weights and for randomly permuted ordering of updates. We study incremental proximal methods as a model of continual learning with generalization and argue that large amount of regularization is crucial to preventing catastrophic forgetting. Our results generalize last iterate guarantees for incremental methods compared to state of the art, as such results were previously known only for overparameterized linear models, which correspond to convex quadratic problems with infinitely many solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. SGD with shuffling: optimal rates without component convexity and large epoch requirements. In Proc. NeurIPS’20, 2020.
  2. A proximal point method for nonsmooth convex optimization problems in banach spaces. In Abstract and Applied Analysis, volume 2, pages 97–120. Hindawi, 1997.
  3. Statistical mechanical analysis of catastrophic forgetting in continual learning with teacher and student networks. Journal of the Physical Society of Japan, 90(10):104001, 2021.
  4. Efficient representations for lifelong learning and autoencoding. In Proc. COLT’15, 2015.
  5. Generalisation guarantees for continual learning with orthogonal gradient descent. arXiv preprint arXiv:2006.11942, 2020.
  6. Dimitri P Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Programming, 129(2):163–195, 2011.
  7. Dimitri P Bertsekas et al. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011.
  8. Léon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proc. Symposium on Learning and Data Science, Paris’09, 2009.
  9. Empirical risk minimization with shuffled SGD: A primal-dual perspective and improved bounds. arXiv preprint arXiv:2306.12498, 2023a.
  10. Cyclic block coordinate descent with variance reduction for composite nonconvex optimization. In Proc. ICML’23, 2023b.
  11. Provable lifelong learning of representations. In Proc. AISTATS’22, 2022.
  12. Tighter lower bounds for shuffling SGD: Random permutations and beyond. arXiv preprint arXiv:2303.07160, 2023.
  13. Memory bounds for continual learning. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), 2022.
  14. Roberto Cominetti. Coupling the proximal point algorithm with approximation methods. Journal of Optimization Theory and Applications, 95:581–600, 1997.
  15. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3366–3385, 2021.
  16. A theoretical analysis of catastrophic forgetting through the NTK overlap matrix. In Proc. AISTATS’21, 2021.
  17. How catastrophic can catastrophic forgetting be in linear regression? In Proc. COLT’22, 2022.
  18. Continual learning in linear classification on separable data. arXiv preprint arXiv:2306.03534, 2023.
  19. The joint effect of task similarity and overparameterization on catastrophic forgetting–an analytical model. arXiv preprint arXiv:2401.12617, 2024.
  20. Analysis of catastrophic forgetting for random orthogonal transformation tasks in the overparameterized regime. In Proc. AISTATS’23, 2023.
  21. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
  22. Characterizing implicit bias in terms of optimization geometry. In Proc. ICML’18, 2018.
  23. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186:49–84, 2021.
  24. Random shuffling beats SGD after finite epochs. In Proc. ICML’19, 2019.
  25. Reinhard Heckel. Provable continual learning via sketched jacobian approximations. In Proc. AISTATS’22, 2022.
  26. J.B. Hiriart-Urruty and C. Lemarechal. Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle Methods. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2013. ISBN 9783662064092.
  27. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
  28. Continual learning in the teacher-student setup: Impact of task similarity. In Proc. ICML’21, 2021.
  29. Bernard Lemaire. About the convergence of the proximal method. In Advances in Optimization: Proceedings of the 6th French-German Colloquium on Optimization Held at Lambrecht, FRG, June 2–8, 1991, pages 39–51. Springer, 1992.
  30. Fixed design analysis of regularization-based continual learning. arXiv preprint arXiv:2303.10263, 2023.
  31. Fast epigraphical projection-based incremental algorithms for wasserstein distributionally robust support vector machine. In Proc. NeurIPS’20, 2020.
  32. Incremental methods for weakly convex optimization. arXiv preprint arXiv:1907.11687, 2019.
  33. Accelerated cyclic coordinate dual averaging with extrapolation for composite convex optimization. In Proc. ICML’23, 2023a.
  34. Theory on forgetting and generalization of continual learning. arXiv preprint arXiv:2302.05836, 2023b.
  35. Revisiting the last-iterate convergence of stochastic gradient methods. arXiv preprint arXiv:2312.08531, 2023.
  36. Sharon L Lohr. Sampling: design and analysis. CRC press, 2021.
  37. Gradient episodic memory for continual learning. In Proc. NIPS’17, 2017.
  38. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation, volume 24, pages 109–165. Elsevier, 1989.
  39. Random reshuffling: Simple analysis with vast improvements. In Proc. NeurIPS’20, 2020.
  40. SGD without replacement: Sharper rates for general smooth convex functions. In Proc. ICML’19, 2019.
  41. A unified convergence analysis for shuffling-type gradient methods. The Journal of Machine Learning Research, 22(1):9397–9440, 2021.
  42. Continual lifelong learning with neural networks: A review. Neural Networks, 113:54–71, 2019.
  43. Continual learning: a feature extraction formalization, an efficient algorithm, and fundamental obstructions. In Proc. NeurIPS’22, 2022.
  44. The ideal continual learner: An agent that never forgets. In Proc. ICML’23, 2023.
  45. Closing the convergence gap of SGD without replacement. In Proc. ICML’20, 2020.
  46. Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
  47. Experience replay for continual learning. In Proc. NeurIPS’19, 2019.
  48. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
  49. How good is SGD with random shuffling? In Proc. COLT’20, 2020.
  50. Inexact and accelerated proximal point algorithms. Journal of Convex Analysis, 19(4):1167–1192, 2012.
  51. Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Proc. NeurIPS’16, 2016.
  52. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proc. ICML’18, 2013.
  53. Cyclic coordinate dual averaging with extrapolation. SIAM Journal on Optimization, 33(4):2935–2961, 2023. doi: 10.1137/22M1470104.
  54. Nearly optimal bounds for cyclic forgetting. In Proc. NeurIPS’24, 2024.
  55. SMG: A shuffling gradient-based method with momentum. In Proc. ICML’21, 2021.
  56. Nesterov accelerated shuffling gradient method for convex optimization. In Proc. ICML’22, 2022.
  57. Accelerated and inexact forward-backward algorithms. SIAM Journal on Optimization, 23(3):1607–1633, 2013.
  58. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
  59. Minibatch vs local SGD with shuffling: Tight convergence bounds and beyond. In Proc. ICLR’22, 2022.
  60. Exact convergence rate of the last iterate in subgradient methods. arXiv preprint arXiv:2307.11134, 2023.
  61. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
Citations (4)

Summary

We haven't generated a summary for this paper yet.