A Geometric Modeling of Occam's Razor in Deep Learning (1905.11027v7)
Abstract: Why do deep neural networks (DNNs) benefit from very high dimensional parameter spaces? Their huge parameter complexities vs. stunning performances in practice is all the more intriguing and not explainable using the standard theory of model selection for regular models. In this work, we propose a geometrically flavored information-theoretic approach to study this phenomenon. Namely, we introduce the locally varying dimensionality of the parameter space of neural network models by considering the number of significant dimensions of the Fisher information matrix, and model the parameter space as a manifold using the framework of singular semi-Riemannian geometry. We derive model complexity measures which yield short description lengths for deep neural network models based on their singularity analysis thus explaining the good performance of DNNs despite their large number of parameters.
- Hirotugu Akaike. A new look at the statistical model identification. IEEE Trans. Automat. Contr., 19(6):716–723, 1974.
- Negative eigenvalues of the Hessian in deep neural networks. In ICLR’18 workshop, 2018. arXiv:1902.02366 [cs.LG].
- Shun-ichi Amari. Information Geometry and Its Applications, volume 194 of Applied Mathematical Sciences. Springer, Japan, 2016.
- Dynamics of learning in MLP: Natural gradient and singularity revisited. Neural Computation, 30(1):1–33, 2018.
- On the category of stratifolds. Cahiers de Topologie et Géométrie Différentielle Catégoriques, LVIII(2):131–160, 2017. arXiv:1605.04142 [math.CT].
- Geometry of lightlike hypersurfaces of a statistical manifold, 2019. arXiv:1901.09251 [math.DG].
- Vijay Balasubramanian. MDL, Bayesian inference and the geometry of the space of probability distributions. In Advances in Minimum Description Length: Theory and Applications, pages 81–98. MIT Press, Cambridge, Massachusetts, 2005.
- The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6):2743–2760, 1998.
- The description length of deep learning models. In Advances in Neural Information Processing Systems 31, pages 2216–2226. Curran Associates, Inc., NY 12571, USA, 2018.
- Ovidiu Calin. Deep learning architectures. Springer, London, 2020.
- Geometric modeling in probability and statistics. Springer, Cham, 2014.
- Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine, 35(1):126–136, 2018.
- Sharp minima can generalize for deep nets. In International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1019–1028, 2017.
- Krishan Duggal. A review on unique existence theorems in lightlike geometry. Geometry, 2014, 2014. Article ID 835394.
- Lightlike Submanifolds of Semi-Riemannian Manifolds and Applications, volume 364 of Mathematics and Its Applications. Springer, Netherlands, 1996.
- Towards modeling and resolving singular parameter spaces using stratifolds. arXiv preprint arXiv:2112.03734, 2021.
- The rank of a random matrix. Applied Mathematics and Computation, 185(1):689–694, 2007.
- Weight agnostic neural networks. In Advances in Neural Information Processing Systems 32, pages 5365–5379. Curran Associates, Inc., NY 12571, USA, 2019.
- Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 315–323, 2011.
- Deep learning. MIT press, Cambridge, Massachusetts, 2016.
- Minimum description length revisited. International Journal of Mathematics for Industry, 11(01), 2020.
- Peter D. Grünwald. The Minimum Description Length Principle. Adaptive Computation and Machine Learning series. The MIT Press, Cambridge, Massachusetts, 2007.
- The spectrum of Fisher information of deep networks achieving dynamical isometry. In International Conference on Artificial Intelligence and Statistics, pages 334–342, 2021.
- Masahito Hayashi. Large deviation theory for non-regular location shift family. Annals of the Institute of Statistical Mathematics, 63(4):689–716, 2011.
- Flat minima. Neural Computation, 9(1):1–42, 1997.
- Harold Hotelling. Spaces of statistical parameters. Bull. Amer. Math. Soc, 36:191, 1930.
- Binarized neural networks. In Advances in Neural Information Processing Systems 29, pages 4107–4115. Curran Associates, Inc., NY 12571, USA, 2016.
- On the geometry of lightlike submanifolds of indefinite statistical manifolds, 2019. arXiv:1903.07387 [math.DG].
- Weyl prior and Bayesian statistics. Entropy, 22(4), 2020.
- Universal statistics of Fisher information in deep neural networks: Mean field approach. In International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 1032–1041, 2019.
- Pathological Spectra of the Fisher Information Metric and Its Variants in Deep Neural Networks. Neural Computation, 33(8):2274–2307, 2021.
- David C Kay. Schaum’s outline of theory and problems of tensor calculus. McGraw-Hill, New York, 1988.
- Andreĭ Nikolaevich Kolmogorov. Sur la notion de la moyenne. G. Bardi, tip. della R. Accad. dei Lincei, Rome, Italy, 1930.
- A unified formulation of k𝑘kitalic_k-Means, fuzzy c𝑐citalic_c-Means and Gaussian mixture model by the Kolmogorov–Nagumo average. Entropy, 23(5):518, 2021.
- Limitations of the empirical Fisher approximation for natural gradient descent. In Advances in Neural Information Processing Systems 32, pages 4158–4169. Curran Associates, Inc., NY 12571, USA, 2019.
- D.N. Kupeli. Singular Semi-Riemannian Geometry, volume 366 of Mathematics and Its Applications. Springer, Netherlands, 1996.
- Stefan L Lauritzen. Statistical manifolds. Differential geometry in statistical inference, 10:163–216, 1987.
- Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations (ICLR), 2018.
- Fisher-Rao metric, geometry, and complexity of neural networks. In International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 888–896, 2019.
- Simplifying momentum-based positive-definite submanifold optimization with applications to deep learning. In International Conference on Machine Learning, pages 21026–21050. PMLR, 2023.
- David J.C. MacKay. Bayesian methods for adaptive models. PhD thesis, California Institute of Technology, 1992.
- Free Probability and Random Matrices, volume 35 of Fields Institute Monographs. Springer, 2017.
- Counting probability distributions: Differential geometry and model selection. Proceedings of the National Academy of Sciences, 97(21):11170–11175, 2000.
- Mitio Nagumo. Über eine Klasse der Mittelwerte. In Japanese journal of mathematics: transactions and abstracts, volume 7, pages 71–79. The Mathematical Society of Japan, 1930.
- The dually flat structure for singular models. Information Geometry, 4(1):31–64, 2021.
- Exploring generalization in deep learning. In Advances in Neural Information Processing Systems 30, pages 5947–5956. Curran Associates, Inc., NY 12571, USA, 2017.
- Affine differential geometry: geometry of affine immersions. Cambridge Tracts in Mathematics. Cambridge university press, Cambridge, United Kingdom, 1994.
- Skip connections eliminate singularities. In International Conference on Learning Representations (ICLR), 2018.
- Revisiting natural gradient for deep networks. In International Conference on Learning Representations (ICLR), 2014.
- Geometry of neural network loss surfaces via random matrix theory. In International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2798–2806, 2017.
- The emergence of spectral universality in deep networks. In International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 1924–1932, 2018.
- The spectrum of the Fisher information matrix of a single-hidden-layer neural network. In Advances in Neural Information Processing Systems 31, pages 5410–5419. Curran Associates, Inc., NY 12571, USA, 2018.
- David Pollard. A note on insufficiency and the preservation of Fisher information. In From Probability to Statistics and Back: High-Dimensional Models and Processes–A Festschrift in Honor of Jon A. Wellner, pages 266–275. Institute of Mathematical Statistics, Beachwood, Ohio, 2013.
- On the expressive power of deep neural networks. In International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2847–2854, 2017.
- Calyampudi Radhakrishna Rao. Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of Cal. Math. Soc., 37(3):81–91, 1945.
- Calyampudi Radhakrishna Rao. Information and the accuracy attainable in the estimation of statistical parameters. In Breakthroughs in statistics, pages 235–247. Springer, New York, NY, 1992.
- Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
- Jorma Rissanen. Fisher information and stochastic complexity. IEEE Trans. Inf. Theory, 42(1):40–47, 1996.
- Empirical analysis of the Hessian of over-parametrized neural networks. In ICLR’18 workshop, 2018. arXiv:1706.04454 [cs.LG].
- Gaussian distributions on Riemannian symmetric spaces: statistical learning with structured covariance matrices. IEEE Transactions on Information Theory, 64(2):752–772, 2017.
- Gideon Schwarz. Estimating the dimension of a model. Ann. Stat., 6(2):461–464, 1978.
- Alexander Soen and Ke Sun. On the variance of the Fisher information for deep learning. In Advances in Neural Information Processing Systems 34, pages 5708–5719, NY 12571, USA, 2021. Curran Associates, Inc.
- Ke Sun and Frank Nielsen. Relative Fisher information and natural gradient for learning large modular models. In International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3289–3298, 2017.
- α𝛼\alphaitalic_α-parallel prior and its properties. IEEE Transactions on Information Theory, 51(3):1011–1023, 2005.
- Philip Thomas. Genga: A generalization of natural gradient ascent with positive and negative convergence results. In International Conference on Machine Learning, volume 32 (2) of Proceedings of Machine Learning Research, pages 1575–1583, 2014.
- Deep learning generalizes because the parameter-function map is biased towards simple functions. In International Conference on Learning Representations (ICLR), 2019.
- An information measure for classification. Computer Journal, 11(2):185–194, 1968.
- Sumio Watanabe. Algebraic Geometry and Statistical Learning Theory, volume 25 of Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, Cambridge, United Kingdom, 2009.
- Dynamics of learning near singularities in layered networks. Neural computation, 20(3):813–843, 2008.
- Statistical mechanical analysis of learning dynamics of two-layer perceptron with multiple output units. Journal of Physics A: Mathematical and Theoretical, 2019.
- Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017.