Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise (2310.18784v7)

Published 28 Oct 2023 in cs.LG, math.OC, math.ST, stat.ML, and stat.TH

Abstract: We study high-probability convergence guarantees of learning on streaming data in the presence of heavy-tailed noise. In the proposed scenario, the model is updated in an online fashion, as new information is observed, without storing any additional data. To combat the heavy-tailed noise, we consider a general framework of nonlinear stochastic gradient descent (SGD), providing several strong results. First, for non-convex costs and component-wise nonlinearities, we establish a convergence rate arbitrarily close to $\mathcal{O}\left(t{-\frac{1}{4}}\right)$, whose exponent is independent of noise and problem parameters. Second, for strongly convex costs and component-wise nonlinearities, we establish a rate arbitrarily close to $\mathcal{O}\left(t{-\frac{1}{2}}\right)$ for the weighted average of iterates, with exponent again independent of noise and problem parameters. Finally, for strongly convex costs and a broader class of nonlinearities, we establish convergence of the last iterate, with a rate $\mathcal{O}\left(t{-\zeta} \right)$, where $\zeta \in (0,1)$ depends on problem parameters, noise and nonlinearity. As we show analytically and numerically, $\zeta$ can be used to inform the preferred choice of nonlinearity for given problem settings. Compared to state-of-the-art, who only consider clipping, require bounded noise moments of order $\eta \in (1,2]$, and establish convergence rates whose exponents go to zero as $\eta \rightarrow 1$, we provide high-probability guarantees for a much broader class of nonlinearities and symmetric density noise, with convergence rates whose exponents are bounded away from zero, even when the noise has finite first moment only. Moreover, in the case of strongly convex functions, we demonstrate analytically and numerically that clipping is not always the optimal nonlinearity, further underlining the value of our general framework.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Qsgd: Communication-efficient sgd via gradient quantization and encoding. Advances in neural information processing systems, 30.
  2. Gradient based clustering. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S., editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 929–947. PMLR.
  3. Personalized federated learning via convex clustering. In 2022 IEEE International Smart Cities Conference (ISC2), pages 1–7.
  4. Stable laws and domains of attraction in free probability theory. Annals of Mathematics, 149(3):1023–1060.
  5. signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pages 560–569. PMLR.
  6. signsgd with majority vote is communication efficient and fault tolerant. arXiv preprint arXiv:1810.05291.
  7. Convex analysis and optimization, volume 1. Athena Scientific.
  8. Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers, pages 177–186. Springer.
  9. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311.
  10. Understanding gradient clipping in private sgd: A geometric perspective. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 13773–13782. Curran Associates, Inc.
  11. Robustness to unbounded smoothness of generalized signsgd. Advances in Neural Information Processing Systems, 35:9955–9968.
  12. Momentum improves normalized sgd. In International conference on machine learning, pages 2260–2268. PMLR.
  13. Dp-normfedavg: Normalizing client updates for privacy-preserving federated learning. arXiv preprint arXiv:2106.07094.
  14. From low probability to high confidence in stochastic convex optimization. The Journal of Machine Learning Research, 22(1):2237–2274.
  15. General tail bounds for non-smooth stochastic mirror descent. arXiv preprint arXiv:2312.07142.
  16. Gama, J. (2012). A survey on learning from data streams: current and future trends. Progress in Artificial Intelligence, 1(1):45–55.
  17. vqsgd: Vector quantized stochastic gradient descent. In International Conference on Artificial Intelligence and Statistics, pages 2197–2205. PMLR.
  18. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469–1492.
  19. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368.
  20. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. Advances in Neural Information Processing Systems, 33:15042–15053.
  21. Near-optimal high probability complexity bounds for non-smooth stochastic optimization with heavy-tailed noise. arXiv preprint arXiv:2106.05958.
  22. Train faster, generalize better: Stability of stochastic gradient descent. In International conference on machine learning, pages 1225–1234. PMLR.
  23. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR.
  24. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512.
  25. Beyond convexity: Stochastic quasi-convex optimization. Advances in neural information processing systems, 28.
  26. Huber, P. J. (1964). Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1):73 – 101.
  27. Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise. SIAM Journal on Optimization, 33(2):394–423.
  28. A short note on concentration inequalities for random vectors with subgaussian norm. arXiv preprint arXiv:1902.03736.
  29. Non-convex distributionally robust optimization: Non-asymptotic analysis. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems, volume 34, pages 2771–2782. Curran Associates, Inc.
  30. Ensemble learning for data stream analysis: A survey. Information Fusion, 37:132–156.
  31. Lan, G. (2012). An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397.
  32. High probability guarantees for nonconvex stochastic gradient descent with heavy tails. In International Conference on Machine Learning, pages 12931–12963. PMLR.
  33. A high probability analysis of adaptive sgd with momentum. arXiv preprint arXiv:2007.14294.
  34. High probability convergence of stochastic gradient methods. In International Conference on Machine Learning, pages 21884–21914. PMLR.
  35. Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise. In Neu, G. and Rosasco, L., editors, Proceedings of Thirty Sixth Conference on Learning Theory, volume 195 of Proceedings of Machine Learning Research, pages 2266–2290. PMLR.
  36. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc.
  37. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609.
  38. Nesterov, Y. (2018). Lectures on Convex Optimization. Springer Publishing Company, Incorporated, 2nd edition.
  39. Improved convergence in high probability of clipped gradient methods with heavy tailed noise. In Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors, Advances in Neural Information Processing Systems, volume 36, pages 24191–24222. Curran Associates, Inc.
  40. High probability convergence of clipped-sgd under heavy-tailed noise. arXiv preprint arXiv:2302.05437.
  41. High probability bounds for stochastic subgradient schemes with heavy tailed noise. arXiv preprint arXiv:2208.08567.
  42. Polyak, B. (1990). New stochastic approximation type procedures. Avtomatica i Telemekhanika, 7:98–107.
  43. Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30:838–855.
  44. Adaptive estimation algorithms: Convergence, optimality, stability. Automation and Remote Control, 1979.
  45. Breaking the heavy-tailed noise barrier in stochastic optimization problems. arXiv preprint arXiv:2311.04161.
  46. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1571–1578.
  47. A stochastic approximation method. The annals of mathematical statistics, pages 400–407.
  48. Ruppert, D. (1988). Efficient Estimations from a Slowly Convergent Robbins-Monro Process. Technical Report 781, Cornell University Operations Research and Industrial Engineering.
  49. High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. In International Conference on Machine Learning, pages 29563–29648. PMLR.
  50. On the heavy-tailed theory of stochastic gradient descent for deep neural networks. arXiv preprint arXiv:1912.00018.
  51. A tail-index analysis of stochastic gradient noise in deep neural networks. In International Conference on Machine Learning, pages 5827–5837. PMLR.
  52. Heavy-tailed streaming statistical estimation. In Camps-Valls, G., Ruiz, F. J. R., and Valera, I., editors, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 1251–1282. PMLR.
  53. Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
  54. Is local SGD better than minibatch SGD? In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10334–10343. PMLR.
  55. Minibatch vs local sgd for heterogeneous distributed learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 6281–6292. Curran Associates, Inc.
  56. Optimization for Data Analysis. Cambridge University Press.
  57. Normalized/clipped sgd with perturbation for differentially private non-convex optimization. arXiv preprint arXiv:2206.13033.
  58. Reducing bert pre-training time from 3 days to 76 minutes. arXiv preprint arXiv:1904.00962, 12.
  59. Secure distributed optimization under gradient attacks. IEEE Transactions on Signal Processing, 71:1802–1816.
  60. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations.
  61. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383–15393.
  62. Understanding clipping for federated learning: Convergence and client-level differential privacy. In International Conference on Machine Learning, ICML 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Aleksandar Armacki (7 papers)
  2. Pranay Sharma (26 papers)
  3. Gauri Joshi (73 papers)
  4. Soummya Kar (147 papers)
  5. Dragana Bajovic (31 papers)
  6. Dusan Jakovetic (47 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com