Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Singular-limit analysis of gradient descent with noise injection (2404.12293v1)

Published 18 Apr 2024 in cs.LG and math.PR

Abstract: We study the limiting dynamics of a large class of noisy gradient descent systems in the overparameterized regime. In this regime the set of global minimizers of the loss is large, and when initialized in a neighbourhood of this zero-loss set a noisy gradient descent algorithm slowly evolves along this set. In some cases this slow evolution has been related to better generalisation properties. We characterize this evolution for the broad class of noisy gradient descent systems in the limit of small step size. Our results show that the structure of the noise affects not just the form of the limiting process, but also the time scale at which the evolution takes place. We apply the theory to Dropout, label noise and classical SGD (minibatching) noise, and show that these evolve on different two time scales. Classical SGD even yields a trivial evolution on both time scales, implying that additional noise is required for regularization. The results are inspired by the training of neural networks, but the theorems apply to noisy gradient descent of any loss that has a non-trivial zero-loss set.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning, pages 948–1024. PMLR, 2022.
  2. SGD with large step sizes learns sparse features. arXiv preprint arXiv:2210.05337, 2022.
  3. Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process. In Conference on learning theory, pages 483–513. PMLR, 2020.
  4. P. Billingsley. Convergence of Probability Measures. Wiley Series in Probability and Statistics. John Wiley & Sons Inc., 1968.
  5. Gradient flow dynamics of shallow relu networks for square loss and orthogonal inputs. arXiv preprint arXiv:2206.00939, 2022.
  6. P. Baldi and P. J. Sadowski. Understanding dropout. Advances in neural information processing systems, 26, 2013.
  7. Flatter, faster: scaling momentum for optimal speedup of sgd. arXiv preprint arXiv:2210.16400, 2022.
  8. Entropy-SGD: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.
  9. Dropout regularization versus ℓ2subscriptℓ2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-penalization in the linear model. arXiv preprint arXiv:2306.10529, 2023.
  10. A. Calzolari and F. Marchetti. Limit motion of an Ornstein–Uhlenbeck particle on the equilibrium manifold of a force field. Journal of Applied Probability, 34(4):924–938, 1997.
  11. Y. Cooper. The loss landscape of overparameterized neural networks. arXiv preprint arXiv:1804.10200, 2018.
  12. S. Dereich and S. Kassing. On minimal representations of shallow relu networks. Neural Networks, 148:121–128, 2022.
  13. S. Dereich and S. Kassing. Convergence of stochastic gradient descent schemes for lojasiewicz-landscapes. arXiv preprint arXiv:2102.09385, 2024.
  14. Quantification of coarse-graining error in Langevin and overdamped Langevin dynamics. Nonlinearity, 31(10):4517–4566, 2018.
  15. Variational approach to coarse-graining of generalized gradient flows. Calc. Var. Partial Differ. Equ., 56(4):65, 2017. Id/No 100.
  16. Label noise SGD provably prefers flat global minimizers. arXiv preprint arXiv:2106.06530, 2021.
  17. K. Falconer. Differentiation of the limit mapping in a dynamical system. Journal of the London Mathematical Society, 2(2):356–372, 1983.
  18. Convergence rates for the stochastic gradient descent method for non-convex objective functions. Journal of Machine Learning Research, 21(136):1–48, 2020.
  19. Reduced dynamics of stochastically perturbed gradient flows. Comm. Math. Sci., 2010.
  20. T. Funaki and H. Nagai. Degenerative convergence of diffusion process toward a submanifold by strong drift. Stochastics: An International Journal of Probability and Stochastic Processes, 44(1-2):1–25, 1993.
  21. T. Funaki. The scaling limit for a stochastic PDE and the separation of phases. Probability Theory and Related Fields, 102(2):221–288, 1995.
  22. B. Frénay and M. Verleysen. Classification in the presence of label noise: A survey. IEEE transactions on neural networks and learning systems, 25(5):845–869, 2013.
  23. Deep learning. MIT press, 2016.
  24. Why (and when) does Local SGD generalize better than SGD? arXiv preprint arXiv:2303.01215, 2023.
  25. Simulated annealing type algorithms for multivariate optimization. Algorithmica, 6(1):419–436, June 1991. ZSCC: 0000059.
  26. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems, 30, 2017.
  27. S. Hochreiter and J. Schmidhuber. Flat minima. Neural computation, 9(1):1–42, 1997.
  28. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  29. E. P. Hsu. Stochastic Analysis on Manifolds. Number 38. American Mathematical Soc., 2002.
  30. Three factors influencing minima in SGD. arXiv preprint arXiv:1711.04623, 2017.
  31. Finding flatter minima with SGD. In ICLR, 2018.
  32. Fantastic generalization measures and where to find them. arXiv preprint arXiv:1912.02178, 2019.
  33. Weight expansion: A new perspective on dropout and generalization. arXiv preprint arXiv:2201.09209, 2022.
  34. G. S. Katzenberger. Solutions of a Stochastic Differential Equation Forced Onto a Manifold by a Large Drift. The Annals of Probability, 19(4):1587 – 1628, 1991.
  35. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
  36. T. G. Kurtz and P. Protter. Weak limit theorems for stochastic integrals and stochastic differential equations. The Annals of Probability, pages 1035–1070, 1991.
  37. What happens after SGD reaches zero loss?–a mathematical framework. arXiv preprint arXiv:2110.06914, 2021.
  38. Fast mixing of Stochastic Gradient Descent with normalization and weight decay. Advances in Neural Information Processing Systems, 35:9233–9248, 2022.
  39. P. Mianjy and R. Arora. On dropout and nuclear norm regularization. In International conference on machine learning, pages 4575–4584. PMLR, 2019.
  40. P. Mianjy and R. Arora. On convergence and generalization of dropout training. Advances in Neural Information Processing Systems, 33:21151–21161, 2020.
  41. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pages 2498–2507. PMLR, 2017.
  42. On the implicit bias of dropout. In International conference on machine learning, pages 3540–3548. PMLR, 2018.
  43. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  44. Universal approximation in dropout neural networks. Journal of Machine Learning Research, 23(19):1–46, 2022.
  45. P.-M. Nguyen. Mean field limit of the learning dynamics of multilayer neural networks. arXiv preprint arXiv:1902.02880, 2019.
  46. Anticorrelated noise injection for improved generalization. In International Conference on Machine Learning, pages 17094–17116. PMLR, 2022.
  47. T. Parsons. Asymptotic analysis of some stochastic models from population dynamics and population genetics. PhD thesis, University of Toronto, 2012.
  48. Relative flatness and generalization. Advances in neural information processing systems, 34:18420–18432, 2021.
  49. Implicit bias of SGD for diagonal linear networks: A provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34:29218–29230, 2021.
  50. T. L. Parsons and T. Rogers. Dimension reduction via timescale separation in stochastic dynamical systems. Arxiv preprint arXiv:01510.07031, 2015.
  51. Notes on the symmetries of 2-layer relu-networks. In Proceedings of the Northern Lights Deep Learning Workshop, volume 1, pages 6–6, 2020.
  52. Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis. In Conference on Learning Theory, pages 1674–1703. PMLR, 2017.
  53. G. M. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of neural networks: An interacting particle system approach. arXiv preprint arXiv:1805.00915, 2018.
  54. A. Senen-Cerda and J. Sanders. Almost sure convergence of dropout algorithms for neural networks. arXiv preprint arXiv:2002.02247, 2020.
  55. A. Senen-Cerda and J. Sanders. Asymptotic convergence rate of dropout on shallow linear neural networks. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 6(2):1–53, 2022.
  56. On the generalization benefit of noise in stochastic gradient descent. In International Conference on Machine Learning, pages 9058–9067. PMLR, 2020.
  57. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  58. J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130(3):1820–1852, 2020.
  59. J. Sirignano and K. Spiliopoulos. Mean field analysis of deep neural networks. Mathematics of Operations Research, 47(1):120–152, 2022.
  60. V. B. Tadić. Convergence and convergence rate of stochastic gradient search in the case of multiple and non-isolated extrema. Stochastic Processes and their Applications, 125(5):1715–1755, 2015.
  61. The implicit and explicit regularization effects of dropout. In International conference on machine learning, pages 10181–10192. PMLR, 2020.
  62. S. Wang and C. Manning. Fast dropout training. In international conference on machine learning, pages 118–126. PMLR, 2013.
  63. How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems, 31, 2018.
  64. S. Wojtowytsch. Stochastic gradient descent with noise of machine learning type part I: Discrete time analysis. Journal of Nonlinear Science, 33(3):45, 2023.
  65. S. Wojtowytsch. Stochastic gradient descent with noise of machine learning type part II: Continuous time analysis. Journal of Nonlinear Science, 34(1):16, 2024.
  66. Dropout training as adaptive regularization. Advances in neural information processing systems, 26, 2013.
  67. The alignment property of SGD noise and how it helps select flat minima: A stability analysis. Advances in Neural Information Processing Systems, 35:4680–4693, 2022.
  68. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058–1066. PMLR, 2013.
  69. Theory of deep learning IIb: Optimization properties of SGD. arXiv preprint arXiv:1801.02254, 2018.
  70. Z. Zhang and Z.-Q. J. Xu. Implicit regularization of dropout. arXiv preprint arXiv:2207.05952, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.