Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Geometric Modeling of Occam's Razor in Deep Learning (1905.11027v7)

Published 27 May 2019 in cs.LG and stat.ML

Abstract: Why do deep neural networks (DNNs) benefit from very high dimensional parameter spaces? Their huge parameter complexities vs. stunning performances in practice is all the more intriguing and not explainable using the standard theory of model selection for regular models. In this work, we propose a geometrically flavored information-theoretic approach to study this phenomenon. Namely, we introduce the locally varying dimensionality of the parameter space of neural network models by considering the number of significant dimensions of the Fisher information matrix, and model the parameter space as a manifold using the framework of singular semi-Riemannian geometry. We derive model complexity measures which yield short description lengths for deep neural network models based on their singularity analysis thus explaining the good performance of DNNs despite their large number of parameters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Hirotugu Akaike. A new look at the statistical model identification. IEEE Trans. Automat. Contr., 19(6):716–723, 1974.
  2. Negative eigenvalues of the Hessian in deep neural networks. In ICLR’18 workshop, 2018. arXiv:1902.02366 [cs.LG].
  3. Shun-ichi Amari. Information Geometry and Its Applications, volume 194 of Applied Mathematical Sciences. Springer, Japan, 2016.
  4. Dynamics of learning in MLP: Natural gradient and singularity revisited. Neural Computation, 30(1):1–33, 2018.
  5. On the category of stratifolds. Cahiers de Topologie et Géométrie Différentielle Catégoriques, LVIII(2):131–160, 2017. arXiv:1605.04142 [math.CT].
  6. Geometry of lightlike hypersurfaces of a statistical manifold, 2019. arXiv:1901.09251 [math.DG].
  7. Vijay Balasubramanian. MDL, Bayesian inference and the geometry of the space of probability distributions. In Advances in Minimum Description Length: Theory and Applications, pages 81–98. MIT Press, Cambridge, Massachusetts, 2005.
  8. The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6):2743–2760, 1998.
  9. The description length of deep learning models. In Advances in Neural Information Processing Systems 31, pages 2216–2226. Curran Associates, Inc., NY 12571, USA, 2018.
  10. Ovidiu Calin. Deep learning architectures. Springer, London, 2020.
  11. Geometric modeling in probability and statistics. Springer, Cham, 2014.
  12. Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine, 35(1):126–136, 2018.
  13. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1019–1028, 2017.
  14. Krishan Duggal. A review on unique existence theorems in lightlike geometry. Geometry, 2014, 2014. Article ID 835394.
  15. Lightlike Submanifolds of Semi-Riemannian Manifolds and Applications, volume 364 of Mathematics and Its Applications. Springer, Netherlands, 1996.
  16. Towards modeling and resolving singular parameter spaces using stratifolds. arXiv preprint arXiv:2112.03734, 2021.
  17. The rank of a random matrix. Applied Mathematics and Computation, 185(1):689–694, 2007.
  18. Weight agnostic neural networks. In Advances in Neural Information Processing Systems 32, pages 5365–5379. Curran Associates, Inc., NY 12571, USA, 2019.
  19. Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 315–323, 2011.
  20. Deep learning. MIT press, Cambridge, Massachusetts, 2016.
  21. Minimum description length revisited. International Journal of Mathematics for Industry, 11(01), 2020.
  22. Peter D. Grünwald. The Minimum Description Length Principle. Adaptive Computation and Machine Learning series. The MIT Press, Cambridge, Massachusetts, 2007.
  23. The spectrum of Fisher information of deep networks achieving dynamical isometry. In International Conference on Artificial Intelligence and Statistics, pages 334–342, 2021.
  24. Masahito Hayashi. Large deviation theory for non-regular location shift family. Annals of the Institute of Statistical Mathematics, 63(4):689–716, 2011.
  25. Flat minima. Neural Computation, 9(1):1–42, 1997.
  26. Harold Hotelling. Spaces of statistical parameters. Bull. Amer. Math. Soc, 36:191, 1930.
  27. Binarized neural networks. In Advances in Neural Information Processing Systems 29, pages 4107–4115. Curran Associates, Inc., NY 12571, USA, 2016.
  28. On the geometry of lightlike submanifolds of indefinite statistical manifolds, 2019. arXiv:1903.07387 [math.DG].
  29. Weyl prior and Bayesian statistics. Entropy, 22(4), 2020.
  30. Universal statistics of Fisher information in deep neural networks: Mean field approach. In International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 1032–1041, 2019.
  31. Pathological Spectra of the Fisher Information Metric and Its Variants in Deep Neural Networks. Neural Computation, 33(8):2274–2307, 2021.
  32. David C Kay. Schaum’s outline of theory and problems of tensor calculus. McGraw-Hill, New York, 1988.
  33. Andreĭ Nikolaevich Kolmogorov. Sur la notion de la moyenne. G. Bardi, tip. della R. Accad. dei Lincei, Rome, Italy, 1930.
  34. A unified formulation of k𝑘kitalic_k-Means, fuzzy c𝑐citalic_c-Means and Gaussian mixture model by the Kolmogorov–Nagumo average. Entropy, 23(5):518, 2021.
  35. Limitations of the empirical Fisher approximation for natural gradient descent. In Advances in Neural Information Processing Systems 32, pages 4158–4169. Curran Associates, Inc., NY 12571, USA, 2019.
  36. D.N. Kupeli. Singular Semi-Riemannian Geometry, volume 366 of Mathematics and Its Applications. Springer, Netherlands, 1996.
  37. Stefan L Lauritzen. Statistical manifolds. Differential geometry in statistical inference, 10:163–216, 1987.
  38. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations (ICLR), 2018.
  39. Fisher-Rao metric, geometry, and complexity of neural networks. In International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 888–896, 2019.
  40. Simplifying momentum-based positive-definite submanifold optimization with applications to deep learning. In International Conference on Machine Learning, pages 21026–21050. PMLR, 2023.
  41. David J.C. MacKay. Bayesian methods for adaptive models. PhD thesis, California Institute of Technology, 1992.
  42. Free Probability and Random Matrices, volume 35 of Fields Institute Monographs. Springer, 2017.
  43. Counting probability distributions: Differential geometry and model selection. Proceedings of the National Academy of Sciences, 97(21):11170–11175, 2000.
  44. Mitio Nagumo. Über eine Klasse der Mittelwerte. In Japanese journal of mathematics: transactions and abstracts, volume 7, pages 71–79. The Mathematical Society of Japan, 1930.
  45. The dually flat structure for singular models. Information Geometry, 4(1):31–64, 2021.
  46. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems 30, pages 5947–5956. Curran Associates, Inc., NY 12571, USA, 2017.
  47. Affine differential geometry: geometry of affine immersions. Cambridge Tracts in Mathematics. Cambridge university press, Cambridge, United Kingdom, 1994.
  48. Skip connections eliminate singularities. In International Conference on Learning Representations (ICLR), 2018.
  49. Revisiting natural gradient for deep networks. In International Conference on Learning Representations (ICLR), 2014.
  50. Geometry of neural network loss surfaces via random matrix theory. In International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2798–2806, 2017.
  51. The emergence of spectral universality in deep networks. In International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 1924–1932, 2018.
  52. The spectrum of the Fisher information matrix of a single-hidden-layer neural network. In Advances in Neural Information Processing Systems 31, pages 5410–5419. Curran Associates, Inc., NY 12571, USA, 2018.
  53. David Pollard. A note on insufficiency and the preservation of Fisher information. In From Probability to Statistics and Back: High-Dimensional Models and Processes–A Festschrift in Honor of Jon A. Wellner, pages 266–275. Institute of Mathematical Statistics, Beachwood, Ohio, 2013.
  54. On the expressive power of deep neural networks. In International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2847–2854, 2017.
  55. Calyampudi Radhakrishna Rao. Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of Cal. Math. Soc., 37(3):81–91, 1945.
  56. Calyampudi Radhakrishna Rao. Information and the accuracy attainable in the estimation of statistical parameters. In Breakthroughs in statistics, pages 235–247. Springer, New York, NY, 1992.
  57. Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
  58. Jorma Rissanen. Fisher information and stochastic complexity. IEEE Trans. Inf. Theory, 42(1):40–47, 1996.
  59. Empirical analysis of the Hessian of over-parametrized neural networks. In ICLR’18 workshop, 2018. arXiv:1706.04454 [cs.LG].
  60. Gaussian distributions on Riemannian symmetric spaces: statistical learning with structured covariance matrices. IEEE Transactions on Information Theory, 64(2):752–772, 2017.
  61. Gideon Schwarz. Estimating the dimension of a model. Ann. Stat., 6(2):461–464, 1978.
  62. Alexander Soen and Ke Sun. On the variance of the Fisher information for deep learning. In Advances in Neural Information Processing Systems 34, pages 5708–5719, NY 12571, USA, 2021. Curran Associates, Inc.
  63. Ke Sun and Frank Nielsen. Relative Fisher information and natural gradient for learning large modular models. In International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3289–3298, 2017.
  64. α𝛼\alphaitalic_α-parallel prior and its properties. IEEE Transactions on Information Theory, 51(3):1011–1023, 2005.
  65. Philip Thomas. Genga: A generalization of natural gradient ascent with positive and negative convergence results. In International Conference on Machine Learning, volume 32 (2) of Proceedings of Machine Learning Research, pages 1575–1583, 2014.
  66. Deep learning generalizes because the parameter-function map is biased towards simple functions. In International Conference on Learning Representations (ICLR), 2019.
  67. An information measure for classification. Computer Journal, 11(2):185–194, 1968.
  68. Sumio Watanabe. Algebraic Geometry and Statistical Learning Theory, volume 25 of Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, Cambridge, United Kingdom, 2009.
  69. Dynamics of learning near singularities in layered networks. Neural computation, 20(3):813–843, 2008.
  70. Statistical mechanical analysis of learning dynamics of two-layer perceptron with multiple output units. Journal of Physics A: Mathematical and Theoretical, 2019.
  71. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017.
Citations (3)

Summary

  • The paper introduces the concept of local dimensionality, measured via the Fisher Information Matrix, to explain the low description lengths in high-capacity deep neural networks.
  • It applies singular semi-Riemannian geometry to model DNN parameter spaces as lightlike neuromanifolds, revealing locally varying effective complexity.
  • The research derives a novel Minimum Description Length formulation that demonstrates how certain parameter directions lower model complexity, challenging traditional criteria.

A Geometric Modeling of Occam's Razor in Deep Learning

This paper presents a novel approach to understanding the performance of deep neural networks (DNNs) using an information-theoretic framework inspired by singular semi-Riemannian geometry. The paper addresses the puzzling question of why DNNs, despite their extensive parameter spaces, can achieve superior performance that defies traditional model complexity penalties such as the Bayesian Information Criterion. The authors introduce the concept of locally varying dimensionality of DNN parameter spaces, evaluated through the significant dimensions of the Fisher Information Matrix (FIM), and utilize this to explain the low description lengths of complex DNN models.

Key Contributions

  1. Singular Semi-Riemannian Geometry in DNNs: The paper applies singular semi-Riemannian geometry to the paper of DNNs, suggesting that neural networks can be represented as "lightlike neuromanifolds." This perspective allows for an analysis of DNNs where the dimensionality is not constant, but can vary locally, influenced by the rank of the FIM.
  2. Local Dimensionality and Model Complexity: A key concept introduced is the 'local dimensionality,' which quantifies the rank of the FIM at a given parameter point. This allows for a nuanced understanding of model complexity that accounts for singularities in the manifold of neural network parameters.
  3. Model Complexity Measures: The paper derives a new Minimum Description Length (MDL) formulation for DNNs. This contrasts traditional model selection criteria by asserting that high-dimensional DNNs have low effective complexity, as certain parameter directions contribute to "negative complexity," allowing models to generalize well without apparent penalties for excessive parameters.
  4. Spectral Analysis of FIM: The paper provides insights into the spectral properties of the FIM, indicating that the singularities (zero eigenvalues) and small eigenvalues provide additional modeling capacity without increasing complexity.

Theoretical and Practical Implications

  • Theoretical Insights: This work suggests a fresh viewpoint in understanding DNNs by considering the manifolds' intrinsic geometry, leading to potential advancements in model selection and complexity theory in machine learning. The concept of negative complexity offers an explanation for the effectiveness of DNNs with numerous parameters, suggesting that not all parameters contribute equally to complexity.
  • Practical Significance: For practitioners, these findings imply that the design and evaluation of DNNs should take into account the geometric and spectral properties of the parameter space. Rather than focusing solely on minimizing parameter count, the nature of parameter interactions and their contributions to model predictions should be considered.

Speculations for Future AI Developments

The paper lays the groundwork for further exploring geometric and information-theoretic approaches to machine learning. Future research could extend these findings, exploring the manifold structures of other machine learning models and their implications for model complexity. Additionally, the integration of these geometric insights with algorithmic strategies, such as gradient descent methods, could enhance optimization techniques in high-dimensional parameter spaces.

In conclusion, "A Geometric Modeling of Occam's Razor in Deep Learning" provides a sophisticated perspective on DNN model complexity, challenging traditional views and opening new avenues for both theoretical and applied machine learning research. The paper's integration of singular semi-Riemannian geometry with DNN analysis offers a potentially transformative framework for understanding and optimizing complex models.

Youtube Logo Streamline Icon: https://streamlinehq.com