Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning
Abstract: The success of SGD in deep learning has been ascribed by prior works to the implicit bias induced by finite batch sizes ("SGD noise"). While prior works focused on offline learning (i.e., multiple-epoch training), we study the impact of SGD noise on online (i.e., single epoch) learning. Through an extensive empirical analysis of image and language data, we demonstrate that small batch sizes do not confer any implicit bias advantages in online learning. In contrast to offline learning, the benefits of SGD noise in online learning are strictly computational, facilitating more cost-effective gradient steps. This suggests that SGD in the online regime can be construed as taking noisy steps along the "golden path" of the noiseless gradient descent algorithm. We study this hypothesis and provide supporting evidence in loss and function space. Our findings challenge the prevailing understanding of SGD and offer novel insights into its role in online learning.
- The implicit regularization of stochastic gradient flow for least squares, 2020.
- Sgd with large step sizes learns sparse features, 2022.
- Francis Bach. Effortless optimization through gradient flows, 2020. [Online; accessed 17-May-2023].
- Deep learning through the lens of example difficulty. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 10876–10889. Curran Associates, Inc., 2021.
- Revisiting model stitching to compare neural representations. CoRR, abs/2106.07682, 2021.
- Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 483–513. PMLR, 09–12 Jul 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Gradient descent on neural networks typically occurs at the edge of stability. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Label noise SGD provably prefers flat global minimizers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
- Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Linear mode connectivity and the lottery ticket hypothesis. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3259–3269. PMLR, 13–18 Jul 2020.
- Stochastic training is not necessary for generalization. CoRR, abs/2109.14119, 2021.
- The three stages of learning dynamics in high-dimensional kernel methods, 2021.
- Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
- Let’s agree to agree: Neural networks share classification order on real datasets. In International Conference on Machine Learning, pages 3950–3960. PMLR, 2020.
- Shape matters: Understanding the implicit bias of the noise covariance. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 2315–2357. PMLR, 15–19 Aug 2021.
- Control batch size and learning rate to generalize well: Theoretical and empirical evidence. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487, 2022.
- Flat minima. Neural Computation, 9:1–42, 1997.
- Train longer, generalize better: closing the generalization gap in large batch training of neural networks, 2018.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- An empirical analysis of compute-optimal large language model training. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- Catastrophic fisher explosion: Early phase fisher matrix impacts generalization. CoRR, abs/2012.14193, 2020.
- Three factors influencing minima in SGD. CoRR, abs/1711.04623, 2017.
- The break-even point on optimization trajectories of deep neural networks. CoRR, abs/2002.09572, 2020.
- Andrej Karpathy. A recipe for training neural networks, 2019.
- On large-batch training for deep learning: Generalization gap and sharp minima. CoRR, abs/1609.04836, 2016.
- Aran Komatsuzaki. One epoch is all you need. arXiv preprint arXiv:1906.06669, 2019.
- Similarity of neural network representations revisited. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 3519–3529. PMLR, 2019.
- Efficient BackProp, pages 9–50. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.
- Achieving small-batch accuracy with large-batch scalability via adaptive learning rate adjustment, 2022.
- The large learning rate phase of deep learning: the catapult mechanism. CoRR, abs/2003.02218, 2020.
- A convnet for the 2020s. CoRR, abs/2201.03545, 2022.
- Revisiting small batch training for deep neural networks. CoRR, abs/1804.07612, 2018.
- Mosaic ML. Mosaic large language models. https://github.com/mosaicml/examples/tree/main/examples/llm, 2023.
- Implicit bias of the step size in linear diagonal neural networks. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 16270–16295. PMLR, 17–23 Jul 2022.
- A large batch optimizer reality check: Traditional, generic optimizers suffice across batch sizes. CoRR, abs/2102.06356, 2021.
- Deep double descent: Where bigger models and more data hurt. CoRR, abs/1912.02292, 2019.
- SGD on neural networks learns functions of increasing complexity. CoRR, abs/1905.11604, 2019.
- The deep bootstrap framework: Good online learners are good offline generalizers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Disentangling the mechanisms behind implicit regularization in sgd, 2022.
- In-context learning and induction heads, 2022.
- Homogenization of sgd in high-dimensions: Exact dynamics and generalization properties, 2022.
- Implicit regularization or implicit conditioning? exact risk trajectories of SGD in high dimensions. In NeurIPS, 2022.
- The pitfalls of simplicity bias in neural networks. CoRR, abs/2006.07710, 2020.
- On the origin of implicit regularization in stochastic gradient descent. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- A walk with sgd, 2018.
- To repeat or not to repeat: Insights from scaling llm under token-crisis, 2023.
- Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.