Phase transitions in the mini-batch size for sparse and dense two-layer neural networks
Abstract: The use of mini-batches of data in training artificial neural networks is nowadays very common. Despite its broad usage, theories explaining quantitatively how large or small the optimal mini-batch size should be are missing. This work presents a systematic attempt at understanding the role of the mini-batch size in training two-layer neural networks. Working in the teacher-student scenario, with a sparse teacher, and focusing on tasks of different complexity, we quantify the effects of changing the mini-batch size $m$. We find that often the generalization performances of the student strongly depend on $m$ and may undergo sharp phase transitions at a critical value $m_c$, such that for $m<m_c$ the training process fails, while for $m>m_c$ the student learns perfectly or generalizes very well the teacher. Phase transitions are induced by collective phenomena firstly discovered in statistical mechanics and later observed in many fields of science. Observing a phase transition by varying the mini-batch size across different architectures raises several questions about the role of this hyperparameter in the neural network learning process.
- The elements of statistical learning: data mining, inference, and prediction, vol. 2 (Springer, 2009).
- Huang, K. Statistical mechanics (John Wiley & Sons, 2008).
- Ergodic observables in non-ergodic systems: The example of the harmonic chain. \JournalTitlePhysica A: Statistical Mechanics and its Applications 129273 (2023).
- Entropy production of a brownian ellipsoid in the overdamped limit. \JournalTitlePhysical Review E 93, 012132 (2016).
- Statistical physics of inference: Thresholds and algorithms. \JournalTitleAdvances in Physics 65, 453–552 (2016).
- Information-theoretic thresholds from the cavity method. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, 146–157 (2017).
- Caracciolo, S. et al. Criticality and conformality in the random dimer model. \JournalTitlePhysical Review E 103, 042127 (2021).
- Information-theoretic thresholds for community detection in sparse networks. In Conference on Learning Theory, 383–416 (PMLR, 2016).
- Fluctuations in the random-link matching problem. \JournalTitlePhysical Review E 100, 032102 (2019).
- Exact value for the average optimal cost of the bipartite traveling salesman and two-factor problems in two dimensions. \JournalTitlePhysical Review E 98, 030101 (2018).
- Two-loop corrections to large order behavior of φ𝜑\varphiitalic_φ4 theory. \JournalTitleNuclear Physics B 922, 293–318 (2017).
- The backtracking survey propagation algorithm for solving random k-sat problems. \JournalTitleNature communications 7, 12996 (2016).
- Information-theoretic and algorithmic thresholds for group testing. \JournalTitleIEEE Transactions on Information Theory 66, 7911–7928 (2020).
- Critical jammed phase of the linear perceptron. \JournalTitlePhysical review letters 123, 115702 (2019).
- Retrieval phase diagrams for attractor neural networks with optimal interactions. \JournalTitleJournal of Physics A: Mathematical and General 23, 3361 (1990).
- Proliferation of non-linear excitations in the piecewise-linear perceptron. \JournalTitleSciPost Physics 10, 013 (2021).
- Statistical mechanics of learning (Cambridge University Press, 2001).
- Out-of-equilibrium dynamical mean-field equations for the perceptron model. \JournalTitleJournal of Physics A: Mathematical and Theoretical 51, 085002 (2018).
- Hard optimization problems have soft edges. \JournalTitleScientific Reports 13, 3671 (2023).
- Equilibrium and non-equilibrium regimes in the learning of restricted boltzmann machines. \JournalTitleAdvances in Neural Information Processing Systems 34, 5345–5359 (2021).
- Statistical dynamics of classical systems. \JournalTitlePhys. Rev. A 8, 423–437 (1973).
- Advanced mean field methods: Theory and practice (MIT press, 2001).
- Rigorous dynamical mean field theory for stochastic gradient descent methods. \JournalTitlearXiv preprint arXiv:2210.06591 (2022).
- Deep learning. \JournalTitlenature 521, 436–444 (2015).
- Deep learning (MIT press, 2016).
- Xu, F. et al. Explainable ai: A brief survey on history, research areas, approaches and challenges. In CCF international conference on natural language processing and Chinese computing, 563–574 (Springer, 2019).
- Unveiling the structure of wide flat minima in neural networks. \JournalTitlePhysical Review Letters 127, 278301 (2021).
- Baldassi, C. et al. Learning through atypical phase transitions in overparameterized neural networks. \JournalTitlePhysical Review E 106, 014116 (2022).
- Deep learning via message passing algorithms based on belief propagation. \JournalTitleMachine Learning: Science and Technology 3, 035005 (2022).
- Chaos and correlated avalanches in excitatory neural networks with synaptic plasticity. \JournalTitlePhysical review letters 118, 098102 (2017).
- Prince, S. J. Understanding Deep Learning (MIT Press, 2023).
- Imagenet classification with deep convolutional neural networks. \JournalTitleAdvances in neural information processing systems 25 (2012).
- Collobert, R. et al. Natural language processing (almost) from scratch. \JournalTitleJournal of machine learning research 12, 2493–2537 (2011).
- Efficient text classification with echo state networks. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–8 (IEEE, 2021).
- Automatic speech recognition, vol. 1 (Springer, 2016).
- Collaborative filtering and deep learning based recommendation system for cold start items. \JournalTitleExpert Systems with Applications 69, 29–39 (2017).
- Hutson, M. Ai protein-folding algorithms solve structures faster than ever. \JournalTitleNature (2019).
- A new iterative technique for solving fractal-fractional differential equations based on artificial neural network in the new generalized caputo sense. \JournalTitleEngineering with Computers 39, 505–515 (2023).
- Solving non-linear kolmogorov equations in large dimensions by using deep learning: a numerical comparison of discretization schemes. \JournalTitleJournal of Scientific Computing 94, 8 (2023).
- Silver, D. et al. Mastering the game of go with deep neural networks and tree search. \JournalTitlenature 529, 484–489 (2016).
- Marino, R. Learning from survey propagation: a neural network for max-e-3-sat. \JournalTitleMachine Learning: Science and Technology 2, 035032 (2021).
- To study the transmission dynamic of sars-cov-2 using nonlinear saturated incidence rate. \JournalTitlePhysica A: Statistical Mechanics and its Applications 604, 127915 (2022).
- Modern graph theory, vol. 184 (Springer Science & Business Media, 1998).
- Learning representations by back-propagating errors. \JournalTitlenature 323, 533–536 (1986).
- Efficient backprop. In Neural networks: Tricks of the trade, 9–48 (Springer, 2012).
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, 1–8 (ICLR, 2015).
- Pittorino, F. et al. Entropic gradient descent algorithms and wide flat minima. \JournalTitleJournal of Statistical Mechanics: Theory and Experiment 2021, 124015 (2021).
- Hastings, W. K. Monte Carlo sampling methods using Markov chains and their applications (Oxford University Press, 1970).
- Optimization by simulated annealing. \JournalTitlescience 220, 671–680 (1983).
- Advective-diffusive motion on large scales from small-scale dynamics with an internal symmetry. \JournalTitlePhys. Rev. E 93, 062147 (2016).
- Diffusion of a brownian ellipsoid in a force field. \JournalTitleEurophysics letters 114, 30005 (2016).
- Parallel tempering: Theory, applications, and new perspectives. \JournalTitlePhysical Chemistry Chemical Physics 7, 3910–3916 (2005).
- Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 661–670 (2014).
- Dropout: a simple way to prevent neural networks from overfitting. \JournalTitleThe journal of machine learning research 15, 1929–1958 (2014).
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
- Bottou, L. Stochastic gradient descent tricks. \JournalTitleNeural Networks: Tricks of the Trade: Second Edition 421–436 (2012).
- Chaudhari, P. et al. Entropy-sgd: Biasing gradient descent into wide valleys. \JournalTitleJournal of Statistical Mechanics: Theory and Experiment 2019, 124018 (2019).
- Gradient-based learning applied to document recognition. \JournalTitleProceedings of the IEEE 86, 2278–2324 (1998).
- Perrone, M. P. et al. Optimal mini-batch size selection for fast gradient descent. \JournalTitlearXiv preprint arXiv:1911.06459 (2019).
- Revisiting small batch training for deep neural networks. \JournalTitlearXiv preprint arXiv:1804.07612 (2018).
- Don’t decay the learning rate, increase the batch size. \JournalTitlearXiv preprint arXiv:1711.00489 (2017).
- Smith, L. N. A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay. \JournalTitlearXiv preprint arXiv:1803.09820 (2018).
- Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. \JournalTitleAdvances in neural information processing systems 32 (2019).
- Cornacchia, E. et al. Learning curves for the multi-class teacher-student perceptron. \JournalTitleMachine Learning: Science and Technology (2022).
- Loureiro, B. et al. Learning curves of generic features maps for realistic datasets with a teacher-student model. \JournalTitleAdvances in Neural Information Processing Systems 34, 18137–18151 (2021).
- Equation of state calculations by fast computing machines. \JournalTitleThe journal of chemical physics 21, 1087–1092 (1953).
- Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks. \JournalTitlearXiv preprint arXiv:2202.00293 (2022).
- On-line learning in soft committee machines. \JournalTitlePhysical Review E 52, 4225 (1995).
- Learning time-scales in two-layers neural networks. \JournalTitlearXiv preprint arXiv:2303.00055 (2023).
- Replica monte carlo simulation of spin-glasses. \JournalTitlePhysical review letters 57, 2607 (1986).
- The monte carlo method. \JournalTitleJournal of the American statistical association 44, 335–341 (1949).
- Stochastic gradient descent-like relaxation is equivalent to glauber dynamics in discrete optimization and inference problems. \JournalTitlearXiv preprint arXiv:2309.05337 (2023).
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.