Solution space and storage capacity of fully connected two-layer neural networks with generic activation functions (2404.13404v2)
Abstract: The storage capacity of a binary classification model is the maximum number of random input-output pairs per parameter that the model can learn. It is one of the indicators of the expressive power of machine learning models and is important for comparing the performance of various models. In this study, we analyze the structure of the solution space and the storage capacity of fully connected two-layer neural networks with general activation functions using the replica method from statistical physics. Our results demonstrate that the storage capacity per parameter remains finite even with infinite width and that the weights of the network exhibit negative correlations, leading to a 'division of labor'. In addition, we find that increasing the dataset size triggers a phase transition at a certain transition point where the permutation symmetry of weights is broken, resulting in the solution space splitting into disjoint regions. We identify the dependence of this transition point and the storage capacity on the choice of activation function. These findings contribute to understanding the influence of activation functions and the number of parameters on the structure of the solution space, potentially offering insights for selecting appropriate architectures based on specific objectives.
- Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, nature 521, 436 (2015).
- L. Zdeborová, Understanding deep learning is also a job for physicists, Nature Physics 16, 602 (2020).
- P. Auer, M. Herbster, and M. K. Warmuth, Exponentially many local minima for single neurons, Advances in neural information processing systems 8 (1995).
- I. Safran and O. Shamir, Spurious local minima are common in two-layer relu neural networks, in International conference on machine learning (PMLR, 2018) pp. 4433–4441.
- C. D. Freeman and J. Bruna, Topology and geometry of half-rectified network optimization, arXiv preprint arXiv:1611.01540 (2016).
- C. Baldassi, E. M. Malatesta, and R. Zecchina, Properties of the geometry of solutions and capacity of multilayer neural networks with rectified linear unit activations, Physical review letters 123, 170602 (2019).
- G. Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2, 303 (1989).
- M. Mézard, G. Parisi, and M. A. Virasoro, Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications, Vol. 9 (World Scientific Publishing Company, 1987).
- H. Nishimori, Statistical physics of spin glasses and information processing: an introduction, 111 (Clarendon Press, 2001).
- L. Zdeborová and F. Krzakala, Statistical physics of inference: Thresholds and algorithms, Advances in Physics 65, 453 (2016).
- E. Gardner and B. Derrida, Optimal storage properties of neural network models, Journal of Physics A: Mathematical and general 21, 271 (1988).
- E. Gardner, The space of interactions in neural network models, Journal of physics A: Mathematical and general 21, 257 (1988).
- E. Barkai, D. Hansel, and H. Sompolinsky, Broken symmetries in multilayered perceptrons, Physical Review A 45, 4146 (1992).
- R. Monasson and R. Zecchina, Weight space structure and internal representations: a direct approach to learning and generalization in multilayer neural networks, Physical review letters 75, 2432 (1995).
- Y. Xiong, C. Kwon, and J.-H. Oh, The storage capacity of a fully-connected committee machine, Advances in Neural Information Processing Systems 10 (1997).
- J. A. Zavatone-Veth and C. Pehlevan, Activation function dependence of the storage capacity of treelike neural networks, Physical Review E 103, L020301 (2021).
- H. S. Seung, H. Sompolinsky, and N. Tishby, Statistical mechanics of learning from examples, Physical review A 45, 6056 (1992).
- T. L. Watkin, A. Rau, and M. Biehl, The statistical mechanics of learning a rule, Reviews of Modern Physics 65, 499 (1993).
- M. Biehl, E. Schlösser, and M. Ahr, Phase transitions in soft-committee machines, Europhysics Letters 44, 261 (1998).
- E. Oostwal, M. Straat, and M. Biehl, Hidden unit specialization in layered neural networks: Relu vs. sigmoidal activation, Physica A: Statistical Mechanics and its Applications 564, 125517 (2021).
- M. Stojnic, Capacity of the treelike sign perceptrons neural networks with one hidden layer–rdt based upper bounds, arXiv preprint arXiv:2312.08244 (2023a).
- M. Stojnic, \\\backslash\emph {{\{{Lifted}}\}} rdt based capacity analysis of the 1-hidden layer treelike\\\backslash\emph {{\{{sign}}\}} perceptrons neural networks, arXiv preprint arXiv:2312.08257 (2023b).
- M. Stojnic, Exact capacity of the\\\backslash\emph {{\{{wide}}\}} hidden layer treelike neural networks with generic activations, arXiv preprint arXiv:2402.05719 (2024a).
- M. Stojnic, Fixed width treelike neural networks capacity analysis–generic activations, arXiv preprint arXiv:2402.05696 (2024b).
- See Supplemental Material for details of the calculation.
- J. R. de Almeida and D. J. Thouless, Stability of the sherrington-kirkpatrick solution of a spin glass model, Journal of Physics A: Mathematical and General 11, 983 (1978).
- A. Blum and R. Rivest, Training a 3-node neural network is np-complete, Advances in neural information processing systems 1 (1988).
- J. Šíma, Training a single sigmoidal neuron is hard, Neural computation 14, 2709 (2002).
- D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
- B. Neyshabur, R. Tomioka, and N. Srebro, In search of the real inductive bias: On the role of implicit regularization in deep learning, arXiv preprint arXiv:1412.6614 (2014).
- H. Yoshino, From complex to simple: hierarchical free-energy landscape renormalized in deep neural networks, SciPost Physics Core 2, 005 (2020).
- A. Engel, Statistical mechanics of learning (Cambridge University Press, 2001).