Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Smoothed Gradient Clipping and Error Feedback for Decentralized Optimization under Symmetric Heavy-Tailed Noise (2310.16920v3)

Published 25 Oct 2023 in math.OC and cs.DC

Abstract: Motivated by understanding and analysis of large-scale machine learning under heavy-tailed gradient noise, we study decentralized optimization with gradient clipping, i.e., in which certain clipping operators are applied to the gradients or gradient estimates computed from local nodes prior to further processing. While vanilla gradient clipping has proven effective in mitigating the impact of heavy-tailed gradient noise in non-distributed setups, it incurs bias that causes convergence issues in heterogeneous distributed settings. To address the inherent bias introduced by gradient clipping, we develop a smoothed clipping operator, and propose a decentralized gradient method equipped with an error feedback mechanism, i.e., the clipping operator is applied on the difference between some local gradient estimator and local stochastic gradient. We consider strongly convex and smooth local functions under symmetric heavy-tailed gradient noise that may not have finite moments of order greater than one. We show that the proposed decentralized gradient clipping method achieves a mean-square error (MSE) convergence rate of $O(1/t\delta)$, $\delta \in (0, 2/5)$, where the exponent $\delta$ is independent of the existence of higher order gradient noise moments $\alpha > 1$ and lower bounded by some constant dependent on condition number. To the best of our knowledge, this is the first MSE convergence result for decentralized gradient clipping under heavy-tailed noise without assuming bounded gradient. Numerical experiments validate our theoretical findings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” Proceedings of Machine learning and systems, vol. 2, pp. 429–450, 2020.
  2. P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., “Advances and open problems in federated learning,” Foundations and Trends® in Machine Learning, vol. 14, no. 1–2, pp. 1–210, 2021.
  3. S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” in International conference on machine learning.   PMLR, 2020, pp. 5132–5143.
  4. J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,” Advances in neural information processing systems, vol. 33, pp. 7611–7623, 2020.
  5. Y. J. Cho, J. Wang, and G. Joshi, “Towards understanding biased client selection in federated learning,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2022, pp. 10 351–10 375.
  6. Y. Ruan, X. Zhang, S.-C. Liang, and C. Joe-Wong, “Towards flexible device participation in federated learning,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2021, pp. 3403–3411.
  7. D. Jakovetić, D. Bajović, A. K. Sahu, S. Kar, N. Milosević, and D. Stamenković, “Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise,” SIAM Journal on Optimization, vol. 33, no. 2, pp. 394–423, 2023.
  8. H. Yang, P. Qiu, and J. Liu, “Taming fat-tailed (“heavier-tailed” with potentially infinite variance) noise in federated learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 017–17 029, 2022.
  9. E. Gorbunov, A. Sadiev, M. Danilova, S. Horváth, G. Gidel, P. Dvurechensky, A. Gasnikov, and P. Richtárik, “High-probability convergence for composite and distributed stochastic minimization and variational inequalities with heavy-tailed noise,” arXiv preprint arXiv:2310.01860, 2023.
  10. U. Şimşekli, M. Gürbüzbalaban, T. H. Nguyen, G. Richard, and L. Sagun, “On the heavy-tailed theory of stochastic gradient descent for deep neural networks,” arXiv preprint arXiv:1912.00018, 2019.
  11. U. Simsekli, L. Sagun, and M. Gurbuzbalaban, “A tail-index analysis of stochastic gradient noise in deep neural networks,” in International Conference on Machine Learning.   PMLR, 2019, pp. 5827–5837.
  12. J. Zhang, S. P. Karimireddy, A. Veit, S. Kim, S. Reddi, S. Kumar, and S. Sra, “Why are adaptive methods good for attention models?” Advances in Neural Information Processing Systems, vol. 33, pp. 15 383–15 393, 2020.
  13. M. Gurbuzbalaban, U. Simsekli, and L. Zhu, “The heavy-tail phenomenon in sgd,” in International Conference on Machine Learning.   PMLR, 2021, pp. 3964–3975.
  14. A. V. Nazin, A. S. Nemirovsky, A. B. Tsybakov, and A. B. Juditsky, “Algorithms of robust stochastic optimization based on mirror descent method,” Automation and Remote Control, vol. 80, pp. 1607–1627, 2019.
  15. N. Puchkin, E. Gorbunov, N. Kutuzov, and A. Gasnikov, “Breaking the heavy-tailed noise barrier in stochastic optimization problems,” arXiv preprint arXiv:2311.04161, 2023.
  16. D. Davis, D. Drusvyatskiy, L. Xiao, and J. Zhang, “From low probability to high confidence in stochastic convex optimization,” The Journal of Machine Learning Research, vol. 22, no. 1, pp. 2237–2274, 2021.
  17. E. Gorbunov, M. Danilova, and A. Gasnikov, “Stochastic optimization with heavy-tailed noise via accelerated gradient clipping,” Advances in Neural Information Processing Systems, vol. 33, pp. 15 042–15 053, 2020.
  18. E. Gorbunov, M. Danilova, I. Shibaev, P. Dvurechensky, and A. Gasnikov, “Near-optimal high probability complexity bounds for non-smooth stochastic optimization with heavy-tailed noise,” arXiv preprint arXiv:2106.05958, 2021.
  19. J. Zhang and A. Cutkosky, “Parameter-free regret in high probability with heavy tails,” Advances in Neural Information Processing Systems, vol. 35, pp. 8000–8012, 2022.
  20. A. Cutkosky and H. Mehta, “High-probability bounds for non-convex stochastic optimization with heavy tails,” Advances in Neural Information Processing Systems, vol. 34, pp. 4883–4895, 2021.
  21. A. Sadiev, M. Danilova, E. Gorbunov, S. Horváth, G. Gidel, P. Dvurechensky, A. Gasnikov, and P. Richtárik, “High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance,” arXiv preprint arXiv:2302.00999, 2023.
  22. A. Koloskova, H. Hendrikx, and S. U. Stich, “Revisiting gradient clipping: Stochastic bias and tight convergence guarantees,” in ICML 2023-40th International Conference on Machine Learning, 2023.
  23. X. Chen, S. Z. Wu, and M. Hong, “Understanding gradient clipping in private sgd: A geometric perspective,” Advances in Neural Information Processing Systems, vol. 33, pp. 13 773–13 782, 2020.
  24. X. Zhang, X. Chen, M. Hong, Z. S. Wu, and J. Yi, “Understanding clipping for federated learning: Convergence and client-level differential privacy,” in International Conference on Machine Learning, ICML 2022, 2022.
  25. B. Li and Y. Chi, “Convergence and privacy of decentralized nonconvex optimization with gradient clipping and communication compression,” arXiv preprint arXiv:2305.09896, 2023.
  26. S. Yu and S. Kar, “Secure distributed optimization under gradient attacks,” IEEE Transactions on Signal Processing, 2023.
  27. S. Khirirat, E. Gorbunov, S. Horváth, R. Islamov, F. Karray, and P. Richtárik, “Clip21: Error feedback for gradient clipping,” arXiv preprint arXiv:2305.18929, 2023.
  28. F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns,” in Fifteenth annual conference of the international speech communication association, 2014.
  29. J. Wu, W. Huang, J. Huang, and T. Zhang, “Error compensated quantized sgd and its applications to large-scale distributed optimization,” in International Conference on Machine Learning.   PMLR, 2018, pp. 5325–5333.
  30. S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi, “Error feedback fixes signsgd and other gradient compression schemes,” in International Conference on Machine Learning.   PMLR, 2019, pp. 3252–3261.
  31. P. Richtárik, I. Sokolov, and I. Fatkhullin, “Ef21: A new, simpler, theoretically better, and practically faster error feedback,” Advances in Neural Information Processing Systems, vol. 34, pp. 4384–4396, 2021.
  32. M. R. Glasgow, H. Yuan, and T. Ma, “Sharp bounds for federated averaging (local sgd) and continuous perspective,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2022, pp. 9050–9090.
  33. J. Wang, R. Das, G. Joshi, S. Kale, Z. Xu, and T. Zhang, “On the unreasonable effectiveness of federated averaging with heterogeneous data,” arXiv preprint arXiv:2206.04723, 2022.
  34. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics.   PMLR, 2017, pp. 1273–1282.
  35. S. P. Karimireddy, L. He, and M. Jaggi, “Byzantine-robust learning on heterogeneous datasets via bucketing,” in International Conference on Learning Representations, 2021.
  36. H. Bercovici, V. Pata, and P. Biane, “Stable laws and domains of attraction in free probability theory,” Annals of Mathematics, pp. 1023–1060, 1999.
  37. C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM transactions on intelligent systems and technology (TIST), vol. 2, no. 3, pp. 1–27, 2011.
  38. I. Pinelis, “Exact lower and upper bounds on the incomplete gamma function,” arXiv preprint arXiv:2005.06384, 2020.
  39. D. W. Nicholson, “Eigenvalue bounds for ab+ ba, with a, b positive definite matrices,” Linear Algebra and its Applications, vol. 24, pp. 173–184, 1979.
  40. S. Kar and J. M. Moura, “Convergence rate analysis of distributed gossip (linear parameter) estimation: Fundamental limits and tradeoffs,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 4, pp. 674–690, 2011.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets