Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning from higher-order statistics, efficiently: hypothesis tests, random features, and neural networks

Published 22 Dec 2023 in stat.ML, cond-mat.stat-mech, and cs.LG | (2312.14922v4)

Abstract: Neural networks excel at discovering statistical patterns in high-dimensional data sets. In practice, higher-order cumulants, which quantify the non-Gaussian correlations between three or more variables, are particularly important for the performance of neural networks. But how efficient are neural networks at extracting features from higher-order cumulants? We study this question in the spiked cumulant model, where the statistician needs to recover a privileged direction or "spike" from the order-$p\ge 4$ cumulants of $d$-dimensional inputs. Existing literature established the presence of a wide statistical-to-computational gap in this problem. We deepen this line of work by finding an exact formula for the likelihood ratio norm which proves that statistical distinguishability requires $n\gtrsim d$ samples, while distinguishing the two distributions in polynomial time requires $n \gtrsim d2$ samples for a wide class of algorithms, i.e. those covered by the low-degree conjecture. Numerical experiments show that neural networks do indeed learn to distinguish the two distributions with quadratic sample complexity, while "lazy" methods like random features are not better than random guessing in this regime. Our results show that neural networks extract information from higher-ordercorrelations in the spiked cumulant model efficiently, and reveal a large gap in the amount of data required by neural networks and random features to learn from higher-order cumulants.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. Maria Refinetti, Alessandro Ingrosso and Sebastian Goldt “Neural networks trained with SGD learn distributions of increasing complexity” In International Conference on Machine Learning, 2023, pp. 28843–28863 PMLR
  2. Arthur Jacot, Franck Gabriel and Clement Hongler “Neural Tangent Kernel: Convergence and Generalization in Neural Networks” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018
  3. “On exact computation with an infinitely wide neural net” In Advances in neural information processing systems 32, 2019
  4. “Disentangling feature and lazy training in deep neural networks” In Journal of Statistical Mechanics: Theory and Experiment 2020.11 IOP Publishing, 2020, pp. 113301
  5. “A nearly tight sum-of-squares lower bound for the planted clique problem” In SIAM Journal on Computing 48, 2019, pp. 687–735
  6. “The power of sum-of-squares for detecting hidden structures” In IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), 2017, pp. 720–731
  7. Samuel B Hopkins and David Steurer “Bayesian estimation from few samples: com- munity detection and related problems” In arXiv:1710.00264, 2017
  8. Samuel Hopkins “Statistical inference and the sum of squares method”, 2018
  9. Jinho Baik, Gérard Ben Arous and Sandrine Péché “Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices” In The Annals of Probability 33.5 Institute of Mathematical Statistics, 2005, pp. 1643–1697 DOI: 10.1214/009117905000000233
  10. “A First Course in Random Matrix Theory: for Physicists, Engineers and Data Scientists” Cambridge University Press, 2020 DOI: 10.1017/9781108768900
  11. “Data-driven emergence of convolutional structure in neural networks” In Proceedings of the National Academy of Sciences 119.40 National Acad Sciences, 2022, pp. e2201854119
  12. “Optimal detection of sparse principal components in high dimension” In The Annals of Statistics Institute of Mathematical Statistics, 2012
  13. “Complexity theoretic lower bounds for sparse principal component detection” In Conference on Learning Theory (COLT), 2013
  14. Thibault Lesieur, Florent Krzakala and Lenka Zdeborová “Phase transitions in sparse PCA” In 2015 IEEE International Symposium on Information Theory (ISIT), 2015, pp. 1635–1639 IEEE
  15. Thibault Lesieur, Florent Krzakala and Lenka Zdeborová “MMSE of probabilistic low-rank matrix estimation: Universality with respect to the output channel” In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2015, pp. 680–687 IEEE
  16. “Optimality and sub-optimality of PCA for spiked random matrices and synchronization” In arXiv:1609.05573, 2016
  17. Florent Krzakala, Jiaming Xu and Lenka Zdeborová “Mutual information in rank-one matrix estimation” In 2016 IEEE Information Theory Workshop (ITW), 2016, pp. 71–75 IEEE
  18. “Mutual information for symmetric rank-one matrix estimation: A proof of the replica formula” In Advances in Neural Information Processing Systems 29, 2016
  19. Léo Miolane “Phase transitions in spiked matrix estimation: information-theoretic analysis” In arXiv:1806.04343, 2018
  20. “Fundamental limits of symmetric low-rank matrix estimation” In Probability Theory and Related Fields 173 Springer, 2019, pp. 859–929
  21. “A statistical model for tensor PCA” In Advances in neural information processing systems 27, 2014
  22. Samuel B Hopkins, Jonathan Shi and David Steurer “Tensor principal component analysis via sum-of-square proofs” In Conference on Learning Theory, 2015, pp. 956–1006 PMLR
  23. Andrea Montanari, Daniel Reichman and Ofer Zeitouni “On the limitation of spectral methods: From the gaussian hidden clique problem to rank-one perturbations of gaussian tensors” In Advances in Neural Information Processing Systems 28, 2015
  24. Amelia Perry, Alexander S Wein and Afonso S Bandeira “Statistical limits of spiked tensor models” In arXiv:1612.07728, 2016
  25. Chiheon Kim, Afonso S Bandeira and Michel X Goemans “Community detection in hypergraphs, spiked tensor models, and sum-of-squares” In 2017 International Conference on Sampling Theory and Applications (SampTA), 2017, pp. 124–128 IEEE
  26. “Statistical and computational phase transitions in spiked tensor estimation” In 2017 IEEE International Symposium on Information Theory (ISIT), 2017, pp. 511–515 IEEE
  27. “Estimation of wasserstein distances in the spiked transport model” In Bernoulli 28.4 Bernoulli Society for Mathematical StatisticsProbability, 2022, pp. 2663–2688
  28. “Statistical physics of inference: Thresholds and algorithms” In Advances in Physics 65.5 Taylor & Francis, 2016, pp. 453–552
  29. Thibault Lesieur, Florent Krzakala and Lenka Zdeborová “Constrained low-rank matrix estimation: Phase transitions, approximate message passing and applications” In Journal of Statistical Mechanics: Theory and Experiment 2017.7 IOP Publishing, 2017, pp. 073403
  30. Bandeira, Amelia Perry and Alexander S. Wein In Portugaliae Mathematica 75.2, 2018, pp. 159–186
  31. Dmitriy Kunisky, Alexander S Wein and Afonso S Bandeira “Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratio” In ISAAC Congress (International Society for Analysis, its Applications and Computation), 2019, pp. 1–50 Springer
  32. “Convergence analysis of two-layer neural networks with relu activation” In Advances in neural information processing systems 30, 2017
  33. “Gradient Descent Provably Optimizes Over-parameterized Neural Networks” In International Conference on Learning Representations, 2019
  34. “Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data” In Advances in Neural Information Processing Systems 31, 2018
  35. Z. Allen-Zhu, Y. Li and Z. Song “A convergence theory for deep learning via over-parameterization” In International Conference on Machine Learning, 2019, pp. 242–252 PMLR
  36. Blake Bordelon, Abdulkadir Canatar and Cengiz Pehlevan “Spectrum dependent learning curves in kernel regression and wide neural networks” In International Conference on Machine Learning, 2020, pp. 1024–1034 PMLR
  37. Abdulkadir Canatar, Blake Bordelon and Cengiz Pehlevan “Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks” In Nature communications 12.1 Nature Publishing Group UK London, 2021, pp. 2914
  38. Quynh Nguyen, Marco Mondelli and Guido F Montufar “Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep relu networks” In International Conference on Machine Learning, 2021, pp. 8119–8129 PMLR
  39. Francis Bach “Breaking the curse of dimensionality with convex neural networks” In The Journal of Machine Learning Research 18.1 JMLR. org, 2017, pp. 629–681
  40. “Limitations of Lazy Training of Two-layers Neural Network” In Advances in Neural Information Processing Systems 32, 2019, pp. 9111–9121
  41. “When do neural networks outperform kernel methods?” In Advances in Neural Information Processing Systems 33, 2020
  42. “Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss” In Conference on Learning Theory, 2020, pp. 1305–1338 PMLR
  43. “Learning parities with neural networks” In Advances in Neural Information Processing Systems 33, 2020, pp. 20356–20365
  44. “Geometric compression of invariant manifolds in neural networks” In Journal of Statistical Mechanics: Theory and Experiment 2021.4 IOP Publishing, 2021, pp. 044001 DOI: 10.1088/1742-5468/abf1f3
  45. “On the Power and Limitations of Random Features for Understanding Neural Networks” In Advances in Neural Information Processing Systems 32, 2019, pp. 6598–6608
  46. “Classifying high-dimensional gaussian mixtures: Where kernel methods fail and neural networks succeed” In International Conference on Machine Learning, 2021, pp. 8936–8947 PMLR
  47. “Information-theoretic thresholds for community detection in sparse networks” In Conference on Learning Theory, 2016, pp. 383–416 PMLR
  48. Harry Kesten and Bernt P Stigum “Additional limit theorems for indecomposable multidimensional Galton-Watson processes” In The Annals of Mathematical Statistics 37.6 JSTOR, 1966, pp. 1463–1481
  49. “Inference and phase transitions in the detection of modules in sparse networks” In Physical Review Letters 107.6 APS, 2011, pp. 065701
  50. “Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications” In Physical review E 84.6 APS, 2011, pp. 066106
  51. Afonso S. Bandeira, Dmitriy Kunisky and Alexander S. Wein “Computational Hardness of Certifying Bounds on Constrained PCA Problems” In 11th Innovations in Theoretical Computer Science Conference (ITCS 2020) 151, Leibniz International Proceedings in Informatics (LIPIcs) Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2020, pp. 78:1–78:29
  52. “Temporal evolution of generalization during learning in linear networks” In Neural Computation 3.4 MIT Press, 1991, pp. 589–603
  53. Yann Le Cun, Ido Kanter and Sara A Solla “Eigenvalues of covariance matrices: Application to neural-network learning” In Physical Review Letters 66.18 APS, 1991, pp. 2396
  54. Andrew M. Saxe, James L. McClelland and Surya Ganguli “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks” In ICLR, 2014 URL: http://arxiv.org/abs/1312.6120
  55. Madhu Advani, Andrew Saxe and Haim Sompolinsky In Neural Netw. 132, 2020, pp. 428–446
  56. Aukosh Jagannath, Patrick Lopatto and Léo Miolane “Statistical thresholds for tensor PCA” In The Annals of Applied Probability 30.4, 2020, pp. 1910–1933
  57. Maria-Florina Balcan, Avrim Blum and Santosh Vempala “Kernels as features: On kernels, margins, and low-dimensional mappings” In Machine Learning 65 Springer, 2006, pp. 79–94
  58. “Random features for large-scale kernel machines” In Advances in neural information processing systems, 2008, pp. 1177–1184
  59. “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning” In Advances in neural information processing systems, 2009, pp. 1313–1320
  60. “Learning gaussian mixtures with generalized linear models: Precise asymptotics in high-dimensions” In Advances in Neural Information Processing Systems 34, 2021, pp. 10144–10157
  61. “Modeling the influence of data structure on learning in neural networks: The hidden manifold model” In Physical Review X 10.4 APS, 2020, pp. 041044
  62. “Generalisation error in learning with random features and the hidden manifold model” In International Conference on Machine Learning, 2020, pp. 3452–3462 PMLR
  63. Hong Hu and Yue M Lu “Universality laws for high-dimensional learning with random features” In IEEE Transactions on Information Theory 69.3 IEEE, 2022, pp. 1932–1964
  64. “The generalization error of random features regression: Precise asymptotics and the double descent curve” In Communications on Pure and Applied Mathematics 75.4 Wiley Online Library, 2022, pp. 667–766
  65. “The gaussian equivalence of generative models for learning with shallow neural networks” In Mathematical and Scientific Machine Learning, 2022, pp. 426–471 PMLR
  66. In Annals of Statistics 49.2, 2021, pp. 1029–1054
  67. “Precise Learning Curves and Higher-Order Scalings for Dot-product Kernel Regression” In Advances in Neural Information Processing Systems 35, 2022, pp. 4558–4570
  68. Song Mei, Theodor Misiakiewicz and Andrea Montanari “Generalization error of random feature and kernel methods: Hypercontractivity and kernel matrix concentration” In Applied and Computational Harmonic Analysis 59 Elsevier, 2022, pp. 3–84
  69. “Six lectures on linearized neural networks” In arXiv:2308.13431, 2023
  70. Rainer Dietrich, Manfred Opper and Haim Sompolinsky “Statistical mechanics of support vector networks” In Physical review letters 82.14 APS, 1999, pp. 2975
  71. “Three unfinished works on the optimal storage capacity of networks” In Journal of Physics A: Mathematical and General 22.12 IOP Publishing, 1989, pp. 1983
  72. Hugo Cui, Florent Krzakala and Lenka Zdeborova “Bayes-optimal learning of deep random networks of extensive-width” In International Conference on Machine Learning, 2023, pp. 6468–6521 PMLR
  73. Francesco Camilli, Daria Tieplova and Jean Barbier “Fundamental limits of overparametrized shallow neural networks for supervised learning” In arXiv preprint arXiv:2307.05635, 2023
  74. Hong Hu and Yue M Lu “Sharp asymptotics of kernel ridge regression beyond the linear regime” In arXiv:2205.06798, 2022
  75. Anthony Bell and Terrence J Sejnowski “Edges are the ’Independent Components’of Natural Scenes.” In Advances in Neural Information Processing Systems 9 MIT Press, 1996
  76. Lénaïc Chizat, Edouard Oyallon and Francis Bach “On Lazy Training in Differentiable Programming” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019
  77. Tamara G. Kolda and Brett W. Bader “Tensor Decompositions and Applications” In SIAM Review 51.3, 2009, pp. 455–500 DOI: 10.1137/07070111X
  78. “Scikit-learn: Machine Learning in Python” In Journal of Machine Learning Research 12, 2011, pp. 2825–2830
  79. “TensorLy: Tensor Learning in Python” In Journal of Machine Learning Research (JMLR) 20.26, 2019
  80. Peter McCullagh “Tensor methods in statistics” Courier Dover Publications, 2018
  81. G. Szegő “Orthogonal Polynomials”, American Mathematical Society colloquium publications American mathematical society, 1939
  82. Milton Abramowitz and Irene A. Stegun “Handbook of mathematical functions with formulas, graphs, and mathematical tables” 55, National Bureau of Standards Applied Mathematics Series, 1964, pp. xiv+1046
  83. Eugene Lukacs “A Survey of the Theory of Characteristic Functions” In Advances in Applied Probability 4.1 Applied Probability Trust, 1972, pp. 1–38
Citations (2)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 32 likes about this paper.