SGD with Clipping is Secretly Estimating the Median Gradient (2402.12828v1)
Abstract: There are several applications of stochastic optimization where one can benefit from a robust estimate of the gradient. For example, domains such as distributed learning with corrupted nodes, the presence of large outliers in the training data, learning under privacy constraints, or even heavy-tailed noise due to the dynamics of the algorithm itself. Here we study SGD with robust gradient estimators based on estimating the median. We first consider computing the median gradient across samples, and show that the resulting method can converge even under heavy-tailed, state-dependent noise. We then derive iterative methods based on the stochastic proximal point method for computing the geometric median and generalizations thereof. Finally we propose an algorithm estimating the median gradient across iterations, and find that several well known methods - in particular different forms of clipping - are particular cases of this framework.
- Robust training in high dimensions via block coordinate geometric median descent. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp. 11145–11168. PMLR, 28–30 Mar 2022.
- Byzantine stochastic gradient descent. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/a07c2f3b3b907aaf8436a26c6d77f0a2-Paper.pdf.
- Stochastic (approximate) proximal point methods: convergence, optimality, and adaptivity. SIAM Journal on Optimization, 29(3):2257–2290, 2019. ISSN 1052-6234.
- Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2nd edition edition, 2017. ISBN 978-1-4419-9467-7.
- Beck, A. First-order methods in optimization, volume 25 of MOS-SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Optimization Society, Philadelphia, PA, 2017. ISBN 978-1-611974-98-0. doi: 10.1137/1.9781611974997.ch1.
- signSGD: Compressed optimisation for non-convex problems. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 560–569. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/bernstein18a.html.
- Bickel, P. J. Some contributions to the theory of order statistics. In Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66), Vol. I: Statistics, pp. 575–591. Univ. California Press, Berkeley, CA, 1967.
- Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Physica-Verlag/Springer, Heidelberg, 2010.
- Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018. ISSN 0036-1445. doi: 10.1137/16M1080173.
- Weak Sharp Minima in Mathematical Programming. SIAM Journal on Control and Optimization, 31(5):1340–1359, 1993. ISSN 0363-0129. doi: 10.1137/0331063.
- Efficient and fast estimation of the geometric median in hilbert spaces with an averaged stochastic gradient algorithm, 2011.
- Dimension-free PAC-bayesian bounds for the estimation of the mean of a random vector. February 2018. doi: 10.48550/ARXIV.1802.04308.
- The moments of the sample median. The Annals of Mathematical Statistics, 26(4):593–606, 1955. ISSN 00034851. URL http://www.jstor.org/stable/2236373.
- Geometric median in nearly linear time. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’16, pp. 9–21, New York, NY, USA, 2016. Association for Computing Machinery. ISBN 9781450341325.
- Conditioning and Upper-Lipschitz Inverse Subdifferentials in Nonsmooth Optimization Problems. Journal of Optimization Theory and Applications, 95(1):127–148, 1997. ISSN 0022-3239, 1573-2878. doi: 10.1023/A:1022687412779.
- Cramér, H. Mathematical Methods of Statistics (PMS-9), Volume 9. Princeton University Press, 2016. ISBN 9781400883868.
- Data encoding for byzantine-resilient distributed gradient descent. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 863–870, 2018. doi: 10.1109/ALLERTON.2018.8636017.
- Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019. ISSN 1052-6234. doi: 10.1137/18M1178244.
- A Guide Through the Zoo of Biased SGD, 2023.
- The notion of breakdown point. In A Festschrift for Erich L. Lehmann, Wadsworth Statist./Probab. Ser., pp. 157–184. Wadsworth, Belmont, CA, 1983. ISBN 0-534-98044-9.
- Ferris, M. C. Finite termination of the proximal point algorithm. Mathematical Programming, 50(1-3):359–366, 1991. ISSN 0025-5610, 1436-4646. doi: 10.1007/BF01594944.
- Fréchet, M. R. Les éléments aléatoires de nature quelconque dans un espace distancié. 1948. URL https://api.semanticscholar.org/CorpusID:124022420.
- Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 15042–15053. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/abd1c782880cc59759f4112fda0b8f98-Paper.pdf.
- The heavy-tail phenomenon in sgd. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 3964–3975. PMLR, 18–24 Jul 2021.
- Robust Statistics: The Approach Based on Influence Functions. Wiley, March 2005. ISBN 9781118186435. doi: 10.1002/9781118186435.
- Multiplicative noise and heavy tails in stochastic optimization. In International Conference on Machine Learning, pp. 4262–4274. PMLR, 2021.
- Huber, P. J. Robust statistics. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Inc., New York, 1981. ISBN 0-471-41805-6.
- Error feedback fixes SignSGD and other gradient compression schemes. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 3252–3261. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/karimireddy19a.html.
- Learning from history for byzantine robust optimization. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 5311–5319. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/karimireddy21a.html.
- Clip21: Error feedback for gradient clipping, 2023.
- Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 17343–17363. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/koloskova23a.html.
- Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=a65YK0cqH8g.
- Lemaire, B. About the Convergence of the Proximal Method. In Oettli, P. D. W. and Pallaschke, P. D. D. (eds.), Advances in Optimization, number 382 in Lecture Notes in Economics and Mathematical Systems, pp. 39–51. Springer Berlin Heidelberg, 1992. ISBN 978-3-540-55446-2 978-3-642-51682-5. doi: 10.1007/978-3-642-51682-5˙4.
- Lemaire, B. Well-posedness, conditioning and regularization of minimization, inclusion and fixed-point problems. Pliska Studia Mathematica Bulgarica, 12(1):71p–84p, 1998.
- Error bounds for convex inequality systems. Generalized convexity, generalized monotonicity: recent results, pp. 75–110, 1998.
- Breakdown points of affine equivariant estimators of multivariate location and covariance matrices. The Annals of Statistics, 19(1):229–248, 1991. ISSN 00905364. URL http://www.jstor.org/stable/2241852.
- Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145–1190, August 2019. ISSN 1615-3383. doi: 10.1007/s10208-019-09427-x.
- A note on estimating the variance of the sample median. Journal of the American Statistical Association, 73(361):194–196, 1978.
- The Hidden Vulnerability of Distributed Learning in Byzantium. In Proceedings of the 35th International Conference on Machine Learning, pp. 3518–3527. PMLR, 2018.
- A semismooth Newton stochastic proximal point algorithm with variance reduction. April 2022.
- Minsker, S. Geometric median and robust estimation in banach spaces. Bernoulli, 21(4), November 2015. ISSN 1350-7265. doi: 10.3150/14-bej645.
- Problem complexity and method efficiency in optimization. A Wiley-Interscience Publication. John Wiley & Sons, Inc., New York, 1983. ISBN 0-471-10345-4. Translated from the Russian and with a preface by E. R. Dawson, Wiley-Interscience Series in Discrete Mathematics.
- Nolan, J. P. Multivariate elliptically contoured stable distributions: theory and estimation. Computational Statistics, 28(5):2067–2089, January 2013. ISSN 1613-9658. doi: 10.1007/s00180-013-0396-7.
- Nolan, J. P. Univariate Stable Distributions. Springer International Publishing, 2020. doi: 10.1007/978-3-030-52915-4.
- Approximate heavy tails in offline (multi-pass) stochastic gradient descent. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=UkPeUXML7s.
- Polyak, B. T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964. ISSN 0041-5553.
- Breaking the heavy-tailed noise barrier in stochastic optimization problems. November 2023. doi: 10.48550/ARXIV.2311.04161.
- EF21: A new, simpler, theoretically better, and practically faster error feedback. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021.
- A direct adaptive method for faster backpropagation learning: the rprop algorithm. In IEEE International Conference on Neural Networks, pp. 586–591 vol.1, 1993. doi: 10.1109/ICNN.1993.298623.
- High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. arXiv preprint arXiv:2302.00999, 2023.
- Uncertainty principle for communication compression in distributed and federated learning and the search for an optimal compressor. Information and Inference: A Journal of the IMA, 11(2):557–580, 04 2021.
- 1-bit stochastic gradient descent and application to data-parallel distributed training of speech dnns. In Interspeech 2014, September 2014.
- Hausdorff dimension, heavy tails, and generalization in neural networks. Advances in Neural Information Processing Systems, 33:5138–5151, 2020.
- On the importance of initialization and momentum in deep learning. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
- The proximal robbins–monro method. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(1):188–212, December 2020. ISSN 1467-9868. doi: 10.1111/rssb.12405.
- The multivariate L1subscript𝐿1{L}_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-median and associated data depth. Proceedings of the National Academy of Sciences, 97(4):1423–1426, 2000. doi: 10.1073/pnas.97.4.1423.
- Convergence rates of stochastic gradient descent under infinite noise variance. Advances in Neural Information Processing Systems, 34:18866–18877, 2021.
- On the point for which the sum of the distances to n given points is minimum. Ann. Oper. Res., 167(1):7–41, 2009. doi: 10.1007/s10479-008-0352-z. URL https://doi.org/10.1007/s10479-008-0352-z.
- Byzantine-robust distributed learning: Towards optimal statistical rates. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5650–5659. PMLR, 10–15 Jul 2018.
- Why gradient clipping accelerates training: A theoretical justification for adaptivity. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020a. URL https://openreview.net/forum?id=BJgnXpVYwS.
- Why are adaptive methods good for attention models? In Proceedings of the 34th International Conference on Neural Information Processing Systems, NeurIPS 2020, Red Hook, NY, USA, 2020b. Curran Associates Inc. ISBN 9781713829546.