Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Unified Theory of Stochastic Proximal Point Methods without Smoothness (2405.15941v1)

Published 24 May 2024 in math.OC and cs.LG

Abstract: This paper presents a comprehensive analysis of a broad range of variations of the stochastic proximal point method (SPPM). Proximal point methods have attracted considerable interest owing to their numerical stability and robustness against imperfect tuning, a trait not shared by the dominant stochastic gradient descent (SGD) algorithm. A framework of assumptions that we introduce encompasses methods employing techniques such as variance reduction and arbitrary sampling. A cornerstone of our general theoretical approach is a parametric assumption on the iterates, correction and control vectors. We establish a single theorem that ensures linear convergence under this assumption and the $\mu$-strong convexity of the loss function, and without the need to invoke smoothness. This integral theorem reinstates best known complexity and convergence guarantees for several existing methods which demonstrates the robustness of our approach. We expand our study by developing three new variants of SPPM, and through numerical experiments we elucidate various properties inherent to them.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
  2. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, 2018.
  3. H. Asi and J. C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257–2290, 2019. doi: 10.1137/18M1230323.
  4. F. R. Bach and É. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Neural Information Processing Systems, 2011. URL https://api.semanticscholar.org/CorpusID:3806935.
  5. D. P. Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical programming, 129(2):163–195, 2011.
  6. A. Beznosikov and A. V. Gasnikov. Compression and data similarity: Combination of two techniques for communication-efficient solving of distributed variational inequalities. In OPTIMA, 2022. URL https://api.semanticscholar.org/CorpusID:249889654.
  7. L. Bottou. Large-scale machine learning with stochastic gradient descent. In International Conference on Computational Statistics, 2010. URL https://api.semanticscholar.org/CorpusID:115963355.
  8. G. Chen and M. Teboulle. Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM Journal on Optimization, 3(3):538–543, 1993. doi: 10.1137/0803026.
  9. A. Defazio. A simple practical accelerated method for finite sums. In Advances in Neural Information Processing Systems 29, 2016.
  10. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27, 2014.
  11. A guide through the zoo of biased SGD. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=OCtv4NyahI.
  12. A unified theory of sgd: Variance reduction, sampling, quantization and coordinate descent. In The 23rd International Conference on Artificial Intelligence and Statistics, 2020.
  13. SGD: General analysis and improved rates. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5200–5209, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/qian19b.html.
  14. Variance-reduced methods for machine learning. Proceedings of the IEEE, 108(11):1968–1983, 2020.
  15. Deep learning with limited numerical precision. In F. Bach and D. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1737–1746, Lille, France, 07–09 Jul 2015. PMLR.
  16. Variance reduced stochastic gradient descent with neighbors. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/effc299a1addb07e7089f9b269c31f2f-Paper.pdf.
  17. R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pages 315–323, 2013.
  18. SCAFFOLD: Stochastic controlled averaging for on-device federated learning. In ICML, 2020.
  19. A. Khaled and C. Jin. Faster federated optimization under second-order similarity. In ICLR, 2023.
  20. A. Khaled and P. Richtárik. Better theory for SGD in the nonconvex world. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=AU4qHN2VkS. Survey Certification.
  21. Convergence and stability of the stochastic proximal point algorithm with momentum. In R. Firoozi, N. Mehr, E. Yel, R. Antonova, J. Bohg, M. Schwager, and M. Kochenderfer, editors, Proceedings of The 4th Annual Learning for Dynamics and Control Conference, volume 168 of Proceedings of Machine Learning Research, pages 1034–1047. PMLR, 23–24 Jun 2022.
  22. J. Konečný and P. Richtárik. Randomized distributed mean estimation: accuracy vs communication. Frontiers in Applied Mathematics and Statistics, 4(62):1–11, 2018.
  23. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10(2):242–255, 2016.
  24. Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, 2020.
  25. Proximal and federated random reshuffling. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 15718–15749. PMLR, 17–23 Jul 2022.
  26. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Mathematical Programming, 155(1–2):549–573, 2015.
  27. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
  28. Y. Nesterov. Lectures on Convex Optimization. Springer Publishing Company, Incorporated, 2nd edition, 2018. ISBN 3319915770.
  29. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In The 34th International Conference on Machine Learning, 2017.
  30. Correlated quantization for faster nonconvex distributed optimization. ArXiv, abs/2401.05518, 2024.
  31. A. Pătraşcu and I. Necoara. Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization. J. Mach. Learn. Res., 18:198:1–198:42, 2017. URL https://api.semanticscholar.org/CorpusID:49679465.
  32. H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
  33. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663–2671, 2012.
  34. E. Ryu and S. Boyd. Stochastic proximal iteration: A non-asymptotic improvement upon stochastic gradient descent. Technical report, Stanford University, 2016.
  35. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  36. Communication-efficient distributed optimization using an approximate newton-type method. In E. P. Xing and T. Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1000–1008, Bejing, China, 22–24 Jun 2014. PMLR.
  37. Permutation compressors for provably faster distributed nonconvex optimization. In 10th International Conference on Learning Representations, 2022.
  38. Variance reduction techniques for stochastic proximal point algorithms. ArXiv, abs/2308.09310, 2023. URL https://api.semanticscholar.org/CorpusID:261031238.
  39. Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In 22nd International Conference on Artificial Intelligence and Statistics, volume 89 of PMLR, pages 1195–1204, 2019.
  40. X.-T. Yuan and P. Li. On convergence of distributed approximate newton methods: Globalization, sharper bounds and beyond. Journal of Machine Learning Research, 21(206):1–51, 2020. URL http://jmlr.org/papers/v21/19-764.html.
  41. Y. Zhang and L. Xiao. DiSCO: Distributed optimization for self-concordant empirical loss. In Proceedings of the 32nd International Conference on Machine Learning, PMLR, volume 37, pages 362–370, 2015.
  42. P. Zhao and T. Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning, PMLR, volume 37, pages 1–9, 2015.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com