Learning from higher-order statistics, efficiently: hypothesis tests, random features, and neural networks
Abstract: Neural networks excel at discovering statistical patterns in high-dimensional data sets. In practice, higher-order cumulants, which quantify the non-Gaussian correlations between three or more variables, are particularly important for the performance of neural networks. But how efficient are neural networks at extracting features from higher-order cumulants? We study this question in the spiked cumulant model, where the statistician needs to recover a privileged direction or "spike" from the order-$p\ge 4$ cumulants of $d$-dimensional inputs. Existing literature established the presence of a wide statistical-to-computational gap in this problem. We deepen this line of work by finding an exact formula for the likelihood ratio norm which proves that statistical distinguishability requires $n\gtrsim d$ samples, while distinguishing the two distributions in polynomial time requires $n \gtrsim d2$ samples for a wide class of algorithms, i.e. those covered by the low-degree conjecture. Numerical experiments show that neural networks do indeed learn to distinguish the two distributions with quadratic sample complexity, while "lazy" methods like random features are not better than random guessing in this regime. Our results show that neural networks extract information from higher-ordercorrelations in the spiked cumulant model efficiently, and reveal a large gap in the amount of data required by neural networks and random features to learn from higher-order cumulants.
- Maria Refinetti, Alessandro Ingrosso and Sebastian Goldt “Neural networks trained with SGD learn distributions of increasing complexity” In International Conference on Machine Learning, 2023, pp. 28843–28863 PMLR
- Arthur Jacot, Franck Gabriel and Clement Hongler “Neural Tangent Kernel: Convergence and Generalization in Neural Networks” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018
- “On exact computation with an infinitely wide neural net” In Advances in neural information processing systems 32, 2019
- “Disentangling feature and lazy training in deep neural networks” In Journal of Statistical Mechanics: Theory and Experiment 2020.11 IOP Publishing, 2020, pp. 113301
- “A nearly tight sum-of-squares lower bound for the planted clique problem” In SIAM Journal on Computing 48, 2019, pp. 687–735
- “The power of sum-of-squares for detecting hidden structures” In IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 2017, pp. 720–731
- Samuel B Hopkins and David Steurer “Bayesian estimation from few samples: com- munity detection and related problems” In arXiv:1710.00264, 2017
- Samuel Hopkins “Statistical inference and the sum of squares method”, 2018
- Jinho Baik, Gérard Ben Arous and Sandrine Péché “Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices” In The Annals of Probability 33.5 Institute of Mathematical Statistics, 2005, pp. 1643–1697 DOI: 10.1214/009117905000000233
- “A First Course in Random Matrix Theory: for Physicists, Engineers and Data Scientists” Cambridge University Press, 2020 DOI: 10.1017/9781108768900
- “Data-driven emergence of convolutional structure in neural networks” In Proceedings of the National Academy of Sciences 119.40 National Acad Sciences, 2022, pp. e2201854119
- “Optimal detection of sparse principal components in high dimension” In The Annals of Statistics Institute of Mathematical Statistics, 2012
- “Complexity theoretic lower bounds for sparse principal component detection” In Conference on Learning Theory (COLT), 2013
- Thibault Lesieur, Florent Krzakala and Lenka Zdeborová “Phase transitions in sparse PCA” In 2015 IEEE International Symposium on Information Theory (ISIT), 2015, pp. 1635–1639 IEEE
- Thibault Lesieur, Florent Krzakala and Lenka Zdeborová “MMSE of probabilistic low-rank matrix estimation: Universality with respect to the output channel” In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2015, pp. 680–687 IEEE
- “Optimality and sub-optimality of PCA for spiked random matrices and synchronization” In arXiv:1609.05573, 2016
- Florent Krzakala, Jiaming Xu and Lenka Zdeborová “Mutual information in rank-one matrix estimation” In 2016 IEEE Information Theory Workshop (ITW), 2016, pp. 71–75 IEEE
- “Mutual information for symmetric rank-one matrix estimation: A proof of the replica formula” In Advances in Neural Information Processing Systems 29, 2016
- Léo Miolane “Phase transitions in spiked matrix estimation: information-theoretic analysis” In arXiv:1806.04343, 2018
- “Fundamental limits of symmetric low-rank matrix estimation” In Probability Theory and Related Fields 173 Springer, 2019, pp. 859–929
- “A statistical model for tensor PCA” In Advances in neural information processing systems 27, 2014
- Samuel B Hopkins, Jonathan Shi and David Steurer “Tensor principal component analysis via sum-of-square proofs” In Conference on Learning Theory, 2015, pp. 956–1006 PMLR
- Andrea Montanari, Daniel Reichman and Ofer Zeitouni “On the limitation of spectral methods: From the gaussian hidden clique problem to rank-one perturbations of gaussian tensors” In Advances in Neural Information Processing Systems 28, 2015
- Amelia Perry, Alexander S Wein and Afonso S Bandeira “Statistical limits of spiked tensor models” In arXiv:1612.07728, 2016
- Chiheon Kim, Afonso S Bandeira and Michel X Goemans “Community detection in hypergraphs, spiked tensor models, and sum-of-squares” In 2017 International Conference on Sampling Theory and Applications (SampTA), 2017, pp. 124–128 IEEE
- “Statistical and computational phase transitions in spiked tensor estimation” In 2017 IEEE International Symposium on Information Theory (ISIT), 2017, pp. 511–515 IEEE
- “Estimation of wasserstein distances in the spiked transport model” In Bernoulli 28.4 Bernoulli Society for Mathematical StatisticsProbability, 2022, pp. 2663–2688
- “Statistical physics of inference: Thresholds and algorithms” In Advances in Physics 65.5 Taylor & Francis, 2016, pp. 453–552
- Thibault Lesieur, Florent Krzakala and Lenka Zdeborová “Constrained low-rank matrix estimation: Phase transitions, approximate message passing and applications” In Journal of Statistical Mechanics: Theory and Experiment 2017.7 IOP Publishing, 2017, pp. 073403
- Bandeira, Amelia Perry and Alexander S. Wein In Portugaliae Mathematica 75.2, 2018, pp. 159–186
- Dmitriy Kunisky, Alexander S Wein and Afonso S Bandeira “Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratio” In ISAAC Congress (International Society for Analysis, its Applications and Computation), 2019, pp. 1–50 Springer
- “Convergence analysis of two-layer neural networks with relu activation” In Advances in neural information processing systems 30, 2017
- “Gradient Descent Provably Optimizes Over-parameterized Neural Networks” In International Conference on Learning Representations, 2019
- “Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data” In Advances in Neural Information Processing Systems 31, 2018
- Z. Allen-Zhu, Y. Li and Z. Song “A convergence theory for deep learning via over-parameterization” In International Conference on Machine Learning, 2019, pp. 242–252 PMLR
- Blake Bordelon, Abdulkadir Canatar and Cengiz Pehlevan “Spectrum dependent learning curves in kernel regression and wide neural networks” In International Conference on Machine Learning, 2020, pp. 1024–1034 PMLR
- Abdulkadir Canatar, Blake Bordelon and Cengiz Pehlevan “Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks” In Nature communications 12.1 Nature Publishing Group UK London, 2021, pp. 2914
- Quynh Nguyen, Marco Mondelli and Guido F Montufar “Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep relu networks” In International Conference on Machine Learning, 2021, pp. 8119–8129 PMLR
- Francis Bach “Breaking the curse of dimensionality with convex neural networks” In The Journal of Machine Learning Research 18.1 JMLR. org, 2017, pp. 629–681
- “Limitations of Lazy Training of Two-layers Neural Network” In Advances in Neural Information Processing Systems 32, 2019, pp. 9111–9121
- “When do neural networks outperform kernel methods?” In Advances in Neural Information Processing Systems 33, 2020
- “Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss” In Conference on Learning Theory, 2020, pp. 1305–1338 PMLR
- “Learning parities with neural networks” In Advances in Neural Information Processing Systems 33, 2020, pp. 20356–20365
- “Geometric compression of invariant manifolds in neural networks” In Journal of Statistical Mechanics: Theory and Experiment 2021.4 IOP Publishing, 2021, pp. 044001 DOI: 10.1088/1742-5468/abf1f3
- “On the Power and Limitations of Random Features for Understanding Neural Networks” In Advances in Neural Information Processing Systems 32, 2019, pp. 6598–6608
- “Classifying high-dimensional gaussian mixtures: Where kernel methods fail and neural networks succeed” In International Conference on Machine Learning, 2021, pp. 8936–8947 PMLR
- “Information-theoretic thresholds for community detection in sparse networks” In Conference on Learning Theory, 2016, pp. 383–416 PMLR
- Harry Kesten and Bernt P Stigum “Additional limit theorems for indecomposable multidimensional Galton-Watson processes” In The Annals of Mathematical Statistics 37.6 JSTOR, 1966, pp. 1463–1481
- “Inference and phase transitions in the detection of modules in sparse networks” In Physical Review Letters 107.6 APS, 2011, pp. 065701
- “Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications” In Physical review E 84.6 APS, 2011, pp. 066106
- Afonso S. Bandeira, Dmitriy Kunisky and Alexander S. Wein “Computational Hardness of Certifying Bounds on Constrained PCA Problems” In 11th Innovations in Theoretical Computer Science Conference (ITCS 2020) 151, Leibniz International Proceedings in Informatics (LIPIcs) Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2020, pp. 78:1–78:29
- “Temporal evolution of generalization during learning in linear networks” In Neural Computation 3.4 MIT Press, 1991, pp. 589–603
- Yann Le Cun, Ido Kanter and Sara A Solla “Eigenvalues of covariance matrices: Application to neural-network learning” In Physical Review Letters 66.18 APS, 1991, pp. 2396
- Andrew M. Saxe, James L. McClelland and Surya Ganguli “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks” In ICLR, 2014 URL: http://arxiv.org/abs/1312.6120
- Madhu Advani, Andrew Saxe and Haim Sompolinsky In Neural Netw. 132, 2020, pp. 428–446
- Aukosh Jagannath, Patrick Lopatto and Léo Miolane “Statistical thresholds for tensor PCA” In The Annals of Applied Probability 30.4, 2020, pp. 1910–1933
- Maria-Florina Balcan, Avrim Blum and Santosh Vempala “Kernels as features: On kernels, margins, and low-dimensional mappings” In Machine Learning 65 Springer, 2006, pp. 79–94
- “Random features for large-scale kernel machines” In Advances in neural information processing systems, 2008, pp. 1177–1184
- “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning” In Advances in neural information processing systems, 2009, pp. 1313–1320
- “Learning gaussian mixtures with generalized linear models: Precise asymptotics in high-dimensions” In Advances in Neural Information Processing Systems 34, 2021, pp. 10144–10157
- “Modeling the influence of data structure on learning in neural networks: The hidden manifold model” In Physical Review X 10.4 APS, 2020, pp. 041044
- “Generalisation error in learning with random features and the hidden manifold model” In International Conference on Machine Learning, 2020, pp. 3452–3462 PMLR
- Hong Hu and Yue M Lu “Universality laws for high-dimensional learning with random features” In IEEE Transactions on Information Theory 69.3 IEEE, 2022, pp. 1932–1964
- “The generalization error of random features regression: Precise asymptotics and the double descent curve” In Communications on Pure and Applied Mathematics 75.4 Wiley Online Library, 2022, pp. 667–766
- “The gaussian equivalence of generative models for learning with shallow neural networks” In Mathematical and Scientific Machine Learning, 2022, pp. 426–471 PMLR
- In Annals of Statistics 49.2, 2021, pp. 1029–1054
- “Precise Learning Curves and Higher-Order Scalings for Dot-product Kernel Regression” In Advances in Neural Information Processing Systems 35, 2022, pp. 4558–4570
- Song Mei, Theodor Misiakiewicz and Andrea Montanari “Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration” In Applied and Computational Harmonic Analysis 59 Elsevier, 2022, pp. 3–84
- “Six lectures on linearized neural networks” In arXiv:2308.13431, 2023
- Rainer Dietrich, Manfred Opper and Haim Sompolinsky “Statistical mechanics of support vector networks” In Physical review letters 82.14 APS, 1999, pp. 2975
- “Three unfinished works on the optimal storage capacity of networks” In Journal of Physics A: Mathematical and General 22.12 IOP Publishing, 1989, pp. 1983
- Hugo Cui, Florent Krzakala and Lenka Zdeborova “Bayes-optimal learning of deep random networks of extensive-width” In International Conference on Machine Learning, 2023, pp. 6468–6521 PMLR
- Francesco Camilli, Daria Tieplova and Jean Barbier “Fundamental limits of overparametrized shallow neural networks for supervised learning” In arXiv preprint arXiv:2307.05635, 2023
- Hong Hu and Yue M Lu “Sharp asymptotics of kernel ridge regression beyond the linear regime” In arXiv:2205.06798, 2022
- Anthony Bell and Terrence J Sejnowski “Edges are the ’Independent Components’of Natural Scenes.” In Advances in Neural Information Processing Systems 9 MIT Press, 1996
- Lénaïc Chizat, Edouard Oyallon and Francis Bach “On Lazy Training in Differentiable Programming” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019
- Tamara G. Kolda and Brett W. Bader “Tensor Decompositions and Applications” In SIAM Review 51.3, 2009, pp. 455–500 DOI: 10.1137/07070111X
- “Scikit-learn: Machine Learning in Python” In Journal of Machine Learning Research 12, 2011, pp. 2825–2830
- “TensorLy: Tensor Learning in Python” In Journal of Machine Learning Research (JMLR) 20.26, 2019
- Peter McCullagh “Tensor methods in statistics” Courier Dover Publications, 2018
- G. Szegő “Orthogonal Polynomials”, American Mathematical Society colloquium publications American mathematical society, 1939
- Milton Abramowitz and Irene A. Stegun “Handbook of mathematical functions with formulas, graphs, and mathematical tables” 55, National Bureau of Standards Applied Mathematics Series, 1964, pp. xiv+1046
- Eugene Lukacs “A Survey of the Theory of Characteristic Functions” In Advances in Applied Probability 4.1 Applied Probability Trust, 1972, pp. 1–38
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.