The Underlying Scaling Laws and Universal Statistical Structure of Complex Datasets (2306.14975v3)
Abstract: We study universal traits which emerge both in real-world complex datasets, as well as in artificially generated ones. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure. We focus on the feature-feature covariance matrix, analyzing both its local and global eigenvalue statistics. Our main observations are: (i) The power-law scalings that the bulk of its eigenvalues exhibit are vastly different for uncorrelated normally distributed data compared to real-world data, (ii) this scaling behavior can be completely modeled by generating Gaussian data with long range correlations, (iii) both generated and real-world datasets lie in the same universality class from the RMT perspective, as chaotic rather than integrable systems, (iv) the expected RMT statistical behavior already manifests for empirical covariance matrices at dataset sizes significantly smaller than those conventionally used for real-world training, and can be related to the number of samples required to approximate the population power-law scaling behavior, (v) the Shannon entropy is correlated with local RMT structure and eigenvalues scaling, is substantially smaller in strongly correlated datasets compared to uncorrelated ones, and requires fewer samples to reach the distribution entropy. These findings show that with sufficient sample size, the Gram matrix of natural image datasets can be well approximated by a Wishart random matrix with a simple covariance structure, opening the door to rigorous studies of neural network dynamics and generalization which rely on the data Gram matrix.
- Daniel L. Ruderman. Origins of scaling in natural images. Vision Research, 37(23):3385–3398, 1997. ISSN 0042-6989. doi: 10.1016/S0042-6989(97)00008-4.
- A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Found Comput Math, 7(2):331–368, 2007. doi: 10.1007/s10208-006-0196-8.
- Scaling laws for neural language models, 2020.
- A solvable model of neural scaling laws, 2022.
- Random-matrix theories in quantum physics: common concepts. Physics Reports, 299(4-6):189–425, jun 1998. doi: 10.1016/s0370-1573(97)00088-4. URL https://doi.org/10.1016%2Fs0370-1573%2897%2900088-4.
- Characterization of chaotic quantum spectra and universality of level fluctuation laws. Physical review letters, 52(1):1, 1984.
- M. L. Mehta. Random matrices, volume 111. Academic Press, 1991.
- A. Pandey. Random matrix theory and quantum chaos. Reviews of Modern Physics, 55(4):807–823, 1983.
- Random matrix theory and financial markets. Physical Review E, 60(5):6519–6532, 1999.
- T. A. Brody. Random matrix models in nuclear physics. Reports on Progress in Physics, 44(4):1125–1191, 1981.
- K. B. Efetov. Supersymmetry and disorder in quantum mechanics. Cambridge University Press, 1997.
- The universal statistical structure and scaling laws of chaos and turbulence, 2023.
- Robert M. Gray. Toeplitz and circulant matrices: A review. Foundations and Trends® in Communications and Information Theory, 2(3):155–239, 2006. ISSN 1567-2190. doi: 10.1561/0100000006. URL http://dx.doi.org/10.1561/0100000006.
- Scaling laws and interpretability of learning from repeated data, 2022.
- Scaling laws under the microscope: Predicting transformer performance from small scale experiments. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 7354–7371. Association for Computational Linguistics, 2022. URL https://aclanthology.org/2022.findings-emnlp.544.
- Revisiting neural scaling laws in language and vision. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/8c22e5e918198702765ecff4b20d0a90-Abstract-Conference.html.
- Scaling laws from the data manifold dimension. J. Mach. Learn. Res., 23:9:1–9:34, 2022. URL http://jmlr.org/papers/v23/20-1111.html.
- Beyond neural scaling laws: beating power law scaling via data pruning. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/7b75da9b61eda40fa35453ee5d077df6-Abstract-Conference.html.
- Lukasz Debowski. A simplistic model of neural scaling laws: Multiperiodic santa fe processes. CoRR, abs/2302.09049, 2023. doi: 10.48550/arXiv.2302.09049. URL https://doi.org/10.48550/arXiv.2302.09049.
- Scaling laws for multilingual neural machine translation. CoRR, abs/2302.09650, 2023. doi: 10.48550/arXiv.2302.09650. URL https://doi.org/10.48550/arXiv.2302.09650.
- Nonlinear random matrix theory for deep learning. In Advances in Neural Information Processing Systems, pages 2637–2646, 2017. URL https://papers.nips.cc/paper/6857-nonlinear-random-matrix-theory-for-deep-learning.
- A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124006, dec 2021. doi: 10.1088/1742-5468/ac3a77.
- Hessian eigenspectra of more realistic nonlinear models, 2021.
- Traditional and heavy-tailed self-regularization in neural network models. arXiv preprint arXiv:1901.08276, 2019. URL https://arxiv.org/abs/1901.08276.
- Random matrix analysis of deep neural network weight matrices. Phys. Rev. E, 106:054124, Nov 2022. doi: 10.1103/PhysRevE.106.054124. URL https://link.aps.org/doi/10.1103/PhysRevE.106.054124.
- Random Matrix Methods for Machine Learning. Cambridge University Press, 2022. doi: 10.1017/9781009128490.
- Universality for the largest eigenvalue of sample covariance matrices with general population. The Annals of Statistics, 43(1), feb 2015. doi: 10.1214/14-aos1281. URL https://doi.org/10.1214%2F14-aos1281.
- Phase transition of the largest eigenvalue for non-null complex sample covariance matrices, 2004.
- Universality laws for high-dimensional learning with random features, 2022.
- Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010.
- The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 2010.
- Fashion-mnist: a novel image classification benchmark based on fashion articles. arXiv preprint arXiv:1708.07747, 2017.
- Cifar-10. URL https://www.cs.toronto.edu/~kriz/cifar.html.
- Tiny imagenet: A benchmark for evaluation of image classification algorithms. International Journal of Computer Vision, 69(2):203–228, 2008.
- Celeba: A large-scale celebrity face attribute dataset. http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html, 2015.
- Statistical mechanics of charged particles. Zeitschrift für Physik, 196:433–453, 1966. doi: 10.1007/BF02750405.
- M. A. Smorodinsky. On the classical motion of charged particles. Journal of Mathematical Physics, 4:1005–1011, 1953. doi: 10.1063/1.1703719.
- Ernst Ising. Beitrag zur theorie des ferromagnetismus. Zeitschrift für Physik, 31:853–865, 1925. doi: 10.1007/BF01343133.
- Distribution of the ratio of consecutive level spacings in random matrix ensembles. Phys. Rev. Lett., 110:084101, 2013. doi: 10.1103/PhysRevLett.110.084101. URL https://link.aps.org/doi/10.1103/PhysRevLett.110.084101.
- M. L. Mehta. Random Matrices. 3 edition, 2004.
- Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.
- Quantum chaos and circuit parameter optimization. Journal of Statistical Mechanics: Theory and Experiment, 2023(2):023104, feb 2023. doi: 10.1088/1742-5468/acb52d. URL https://doi.org/10.1088%2F1742-5468%2Facb52d.
- Localization of interacting fermions at high temperature. Phys. Rev. B, 75:155111, 2007. doi: 10.1103/PhysRevB.75.155111. URL https://link.aps.org/doi/10.1103/PhysRevB.75.155111.
- Black holes and random matrices. J. High Energy Phys., 2017:118, 2017. doi: 10.1007/JHEP05(2017)118. URL https://link.springer.com/article/10.1007/JHEP05(2017)118.
- Junyu Liu. Spectral form factors and late time quantum chaos. Physical Review D, 98(8), oct 2018. doi: 10.1103/physrevd.98.086026. URL https://doi.org/10.1103%2Fphysrevd.98.086026.
- Quantum chaos challenges many-body localization. Phys. Rev. E, 102:062144, 2020. doi: 10.1103/PhysRevE.102.062144. URL https://link.aps.org/doi/10.1103/PhysRevE.102.062144.
- John Wishart Wishart. Generalised product moment distribution in samples from an indefinitely large population. Biometrika, 20(1-2):30–52, 1928.
- On the empirical distribution of eigenvalues of a class of large dimensional random matrices. Journal of Multivariate Analysis, 54(2):175–199, 1995.
- Information theory and statistics. The Annals of Mathematical Statistics, 22(1):79–86, 1951.
- Claude Elwood Shannon. A mathematical theory of communication. Bell System Technical Journal, 27(3):379–423, 1948.
- Alfred Rényi. On measures of information and entropy. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1(4):547–561, 1956.
- Generalisation error in learning with random features and the hidden manifold model. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124013, dec 2021. doi: 10.1088/1742-5468/ac3ae6. URL https://doi.org/10.1088%2F1742-5468%2Fac3ae6.
- The generalization error of random features regression: Precise asymptotics and double descent curve, 2020.
- Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning, pages 2280–2290. PMLR, 2020.
- The landscape of the spiked tensor model, 2018.
- The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
- Generalisation error in learning with random features and the hidden manifold model. In Proceedings of the 37th International Conference on Machine Learning, pages 3452–3462. PMLR, 2020.
- The gaussian equivalence of generative models for learning with shallow neural networks. In Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, pages 426–471. PMLR, 2022.
- Learning curves of generic features maps for realistic datasets with a teacher-student model. In Advances in Neural Information Processing Systems, volume 34, 2021.
- More than a toy: Random matrix models predict how real-world neural representations generalize. In Proceedings of the 39th International Conference on Machine Learning, pages 23549–23588. PMLR, 2022.
- Gaussian universality of perceptrons with random labels. arXiv:2205.13303, 2023.
- Are gaussian data all you need? extents and limits of universality in high-dimensional generalized linear estimation. In Hal Daumé III and Aarti Singh, editors, Proceedings of The 40th International Conference on Machine Learning, volume 162, pages 10–15. PMLR, 2023.
- Statistical mechanics of learning from examples. Physical review A, 45(8):6056, 1992.
- The statistical mechanics of learning a rule. Reviews of Modern Physics, 65(2):499, 1993.
- Andreas Engel and Christian Van den Broeck. Statistical mechanics of learning. Cambridge University Press, 2001.
- David L Donoho. Statistical modeling of natural images with wavelets. Proceedings of the National Academy of Sciences, 92(12):5191–5196, 1995.
- On robust regression with high-dimensional predictors. Proceedings of the National Academy of Sciences, 110(36):14557–14562, 2013.
- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations, 2014.
- Statistical physics of inference: Thresholds and algorithms. Advances in Physics, 65(5):453–552, 2016.
- High dimensional robust m-estimation: Asymptotic variance via approximate message passing. Probability Theory and Related Fields, 166(3-4):935–969, 2016.
- The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
- Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
- The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression. The Annals of Statistics, 48(1):27–42, 2020.
- Generalization error in high-dimensional perceptrons: Approaching bayes error with convex optimization. In Advances in Neural Information Processing Systems, volume 33, 2020.
- The performance analysis of generalized margin maximizers on separable data. In International Conference on Machine Learning, pages 8417–8426. PMLR, 2020.
- Michael A Nielsen. A simple formula for the average gate fidelity of a quantum dynamical operation. Physics Letters A, 303(4):249–252, October 2002. ISSN 0375-9601. doi: 10.1016/s0375-9601(02)01272-0. URL http://dx.doi.org/10.1016/S0375-9601(02)01272-0.
- Random matrix theory proves that deep learning representations of gan-data behave as gaussian mixtures, 2020.
- A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23):11537–11546, May 2019. ISSN 1091-6490. doi: 10.1073/pnas.1820226116. URL http://dx.doi.org/10.1073/pnas.1820226116.