Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Parallel Momentum Methods Under Biased Gradient Estimations (2403.00853v2)

Published 29 Feb 2024 in cs.LG

Abstract: Parallel stochastic gradient methods are gaining prominence in solving large-scale machine learning problems that involve data distributed across multiple nodes. However, obtaining unbiased stochastic gradients, which have been the focus of most theoretical research, is challenging in many distributed machine learning applications. The gradient estimations easily become biased, for example, when gradients are compressed or clipped, when data is shuffled, and in meta-learning and reinforcement learning. In this work, we establish worst-case bounds on parallel momentum methods under biased gradient estimation on both general non-convex and $\mu$-PL problems. Our analysis covers general distributed optimization problems, and we work out the implications for special cases where gradient estimates are biased, i.e. in meta-learning and when the gradients are compressed or clipped. Our numerical experiments verify our theoretical findings and show faster convergence performance of momentum methods than traditional biased gradient descent.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng, “On optimization methods for deep learning,” in Proceedings of the 28th International Conference on International Conference on Machine Learning, 2011, pp. 265–272.
  2. A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics.   JMLR Workshop and Conference Proceedings, 2011, pp. 215–223.
  3. M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. G. Andersen, and A. Smola, “Parameter server for distributed machine learning,” in Big learning NIPS workshop, vol. 6, no. 2, 2013.
  4. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
  5. I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learning.   PMLR, 2013, pp. 1139–1147.
  6. L. Bottou, “Stochastic gradient descent tricks,” Neural Networks: Tricks of the Trade: Second Edition, pp. 421–436, 2012.
  7. ——, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers.   Springer, 2010, pp. 177–186.
  8. M. Zinkevich, M. Weimer, L. Li, and A. Smola, “Parallelized stochastic gradient descent,” Advances in neural information processing systems, vol. 23, 2010.
  9. I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
  10. H. R. Feyzmahdavian, A. Aytekin, and M. Johansson, “An asynchronous mini-batch algorithm for regularized stochastic optimization,” IEEE Transactions on Automatic Control, vol. 61, no. 12, pp. 3740–3754, 2016.
  11. S. Khirirat, H. R. Feyzmahdavian, and M. Johansson, “Mini-batch gradient descent: Faster convergence under data sparsity,” in 2017 IEEE 56th Annual Conference on Decision and Control (CDC).   IEEE, 2017, pp. 2880–2887.
  12. R. Ge, S. M. Kakade, R. Kidambi, and P. Netrapalli, “The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares,” Advances in neural information processing systems, vol. 32, 2019.
  13. X. Wang, S. Magnússon, and M. Johansson, “On the convergence of step decay step-size for stochastic optimization,” Advances in Neural Information Processing Systems, vol. 34, pp. 14 226–14 238, 2021.
  14. O. Sebbouh, R. M. Gower, and A. Defazio, “Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball,” in Conference on Learning Theory.   PMLR, 2021, pp. 3935–3971.
  15. B. E. Woodworth, K. K. Patel, and N. Srebro, “Minibatch vs local SGD for heterogeneous distributed learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 6281–6292, 2020.
  16. D. Csiba and P. Richtárik, “Importance sampling for minibatches,” The Journal of Machine Learning Research, vol. 19, no. 1, pp. 962–982, 2018.
  17. L. Xiao and T. Zhang, “A proximal stochastic gradient method with progressive variance reduction,” SIAM Journal on Optimization, vol. 24, no. 4, pp. 2057–2075, 2014.
  18. R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” Advances in neural information processing systems, vol. 26, 2013.
  19. K. Mishchenko, A. Khaled, and P. Richtárik, “Random reshuffling: Simple analysis with vast improvements,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 309–17 320, 2020.
  20. J. Haochen and S. Sra, “Random shuffling beats SGD after finite epochs,” in International Conference on Machine Learning.   PMLR, 2019, pp. 2624–2633.
  21. M. Gürbüzbalaban, A. Ozdaglar, and P. A. Parrilo, “Why random reshuffling beats stochastic gradient descent,” Mathematical Programming, vol. 186, pp. 49–84, 2021.
  22. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: communication-efficient SGD via gradient quantization and encoding,” Advances in neural information processing systems, vol. 30, 2017.
  23. S. Khirirat, H. R. Feyzmahdavian, and M. Johansson, “Distributed learning with compressed gradients,” arXiv preprint arXiv:1806.06573, 2018.
  24. S. Khirirat, M. Johansson, and D. Alistarh, “Gradient compression for communication-limited convex optimization,” in 2018 IEEE Conference on Decision and Control (CDC).   IEEE, 2018, pp. 166–171.
  25. A. Beznosikov, S. Horváth, P. Richtárik, and M. Safaryan, “On biased compression for distributed learning,” arXiv preprint arXiv:2002.12410, 2020.
  26. R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International conference on machine learning.   Pmlr, 2013, pp. 1310–1318.
  27. D. Alistarh, Z. Allen-Zhu, and J. Li, “Byzantine stochastic gradient descent,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  28. K. Pillutla, S. M. Kakade, and Z. Harchaoui, “Robust aggregation for federated learning,” IEEE Transactions on Signal Processing, vol. 70, pp. 1142–1154, 2022.
  29. L. Li, W. Xu, T. Chen, G. B. Giannakis, and Q. Ling, “RSA: byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 1544–1551.
  30. C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International conference on machine learning.   PMLR, 2017, pp. 1126–1135.
  31. A. Ruszczynski and W. Syski, “Stochastic approximation method with gradient averaging for unconstrained problems,” IEEE Transactions on Automatic Control, vol. 28, no. 12, pp. 1097–1105, 1983.
  32. A. Ajalloeian and S. U. Stich, “On the convergence of SGD with biased gradients,” arXiv preprint arXiv:2008.00051, 2020.
  33. Y. Liu, Y. Gao, and W. Yin, “An improved analysis of stochastic gradient descent with momentum,” Advances in Neural Information Processing Systems, vol. 33, pp. 18 261–18 271, 2020.
  34. D. Needell, R. Ward, and N. Srebro, “Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm,” Advances in neural information processing systems, vol. 27, 2014.
  35. D. P. Bertsekas et al., “Incremental gradient, subgradient, and proximal methods for convex optimization: A survey,” Optimization for Machine Learning, vol. 2010, no. 1-38, p. 3, 2011.
  36. B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” Ussr computational mathematics and mathematical physics, vol. 4, no. 5, pp. 1–17, 1964.
  37. Y. E. Nesterov, “A method of solving a convex programming problem with convergence rate 𝒪⁢(1k2)𝒪1superscript𝑘2\mathcal{O}\left(\frac{1}{k^{2}}\right)caligraphic_O ( divide start_ARG 1 end_ARG start_ARG italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ),” in Doklady Akademii Nauk, vol. 269, no. 3.   Russian Academy of Sciences, 1983, pp. 543–547.
  38. A. Gupal and L. Bazhenov, “Stochastic analog of the conjugant-gradient method,” Cybernetics, vol. 8, no. 1, pp. 138–140, 1972.
  39. Y. Yan, T. Yang, Z. Li, Q. Lin, and Y. Yang, “A unified analysis of stochastic momentum methods for deep learning,” arXiv preprint arXiv:1808.10396, 2018.
  40. H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization,” in International Conference on Machine Learning.   PMLR, 2019, pp. 7184–7193.
  41. J. Ma and D. Yarats, “Quasi-hyperbolic momentum and adam for deep learning,” arXiv preprint arXiv:1810.06801, 2018.
  42. H. Wang, Y. Luo, W. An, Q. Sun, J. Xu, and L. Zhang, “Pid controller-based stochastic optimization acceleration for deep neural networks,” IEEE transactions on neural networks and learning systems, vol. 31, no. 12, pp. 5079–5091, 2020.
  43. E. Ghadimi, H. R. Feyzmahdavian, and M. Johansson, “Global convergence of the heavy-ball method for convex optimization,” in 2015 European control conference (ECC).   IEEE, 2015, pp. 310–315.
  44. N. Loizou and P. Richtárik, “Linearly convergent stochastic heavy ball method for minimizing generalization error,” arXiv preprint arXiv:1710.10737, 2017.
  45. M. Assran and M. Rabbat, “On the convergence of nesterov’s accelerated gradient method in stochastic settings,” arXiv preprint arXiv:2002.12414, 2020.
  46. R. Xin and U. A. Khan, “Distributed heavy-ball: A generalization and acceleration of first-order methods with gradient tracking,” IEEE Transactions on Automatic Control, vol. 65, no. 6, pp. 2627–2633, 2019.
  47. G. Garrigos and R. M. Gower, “Handbook of convergence theorems for (stochastic) gradient methods,” arXiv preprint arXiv:2301.11235, 2023.
  48. T. H. Tran, L. M. Nguyen, and Q. Tran-Dinh, “SMG: a shuffling gradient-based method with momentum,” in International Conference on Machine Learning.   PMLR, 2021, pp. 10 379–10 389.
  49. S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi, “Error feedback fixes signsgd and other gradient compression schemes,” in International Conference on Machine Learning.   PMLR, 2019, pp. 3252–3261.
  50. J. Zhang, T. He, S. Sra, and A. Jadbabaie, “Why gradient clipping accelerates training: A theoretical justification for adaptivity,” arXiv preprint arXiv:1905.11881, 2019.
  51. V. V. Mai and M. Johansson, “Stability and convergence of stochastic gradient clipping: Beyond lipschitz continuity and smoothness,” in International Conference on Machine Learning.   PMLR, 2021, pp. 7325–7335.
  52. B. Zhang, J. Jin, C. Fang, and L. Wang, “Improved analysis of clipping algorithms for non-convex optimization,” Advances in Neural Information Processing Systems, vol. 33, pp. 15 511–15 521, 2020.
  53. D. P. Bertsekas and J. N. Tsitsiklis, “Gradient convergence in gradient methods with errors,” SIAM Journal on Optimization, vol. 10, no. 3, pp. 627–642, 2000.
  54. H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” in Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16.   Springer, 2016, pp. 795–811.
  55. D. J. Foster, A. Sekhari, and K. Sridharan, “Uniform convergence of gradients for non-convex learning and optimization,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  56. Y. Lei, T. Hu, G. Li, and K. Tang, “Stochastic gradient descent for nonconvex learning without bounded gradient assumptions,” IEEE transactions on neural networks and learning systems, vol. 31, no. 10, pp. 4394–4400, 2019.
  57. C. Herrera, F. Krach, and J. Teichmann, “Estimating full lipschitz constants of deep neural networks,” arXiv preprint arXiv:2004.13135, 2020.
  58. M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. Ward, “The power of adaptivity in SGD: Self-tuning step sizes with unbounded gradients and affine variance,” in Conference on Learning Theory.   PMLR, 2022, pp. 313–355.
  59. F. Khani and P. Liang, “Feature noise induces loss discrepancy across groups,” in International Conference on Machine Learning.   PMLR, 2020, pp. 5209–5219.
  60. H. Xu, C. Caramanis, and S. Mannor, “Robust regression and lasso,” Advances in neural information processing systems, vol. 21, 2008.
  61. P. Richtárik, I. Sokolov, and I. Fatkhullin, “EF21: a new, simpler, theoretically better, and practically faster error feedback,” Advances in Neural Information Processing Systems, vol. 34, pp. 4384–4396, 2021.
  62. S. Khirirat, S. Magnússon, A. Aytekin, and M. Johansson, “A flexible framework for communication-efficient machine learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 9, 2021, pp. 8101–8109.
  63. J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra, “Why are adaptive methods good for attention models?” Advances in Neural Information Processing Systems, vol. 33, pp. 15 383–15 393, 2020.
  64. Z. Huo, B. Gu, J. Liu, and H. Huang, “Accelerated method for stochastic composition optimization with nonsmooth regularization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
  65. P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman, “Sparse additive models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 71, no. 5, pp. 1009–1030, 2009.
  66. A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learning with theoretical guarantees: A model-agnostic meta-learning approach,” Advances in Neural Information Processing Systems, vol. 33, pp. 3557–3568, 2020.
  67. M. Wang, E. X. Fang, and H. Liu, “Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions,” Mathematical Programming, vol. 161, pp. 419–449, 2017.
  68. Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun. com/exdb/mnist/, 1998.
  69. H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
  70. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
Citations (1)

Summary

We haven't generated a summary for this paper yet.