Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Deep Networks Learn Sparse and Hierarchical Data: the Sparse Random Hierarchy Model (2404.10727v2)

Published 16 Apr 2024 in stat.ML, cond-mat.dis-nn, and cs.LG

Abstract: Understanding what makes high-dimensional data learnable is a fundamental question in machine learning. On the one hand, it is believed that the success of deep learning lies in its ability to build a hierarchy of representations that become increasingly more abstract with depth, going from simple features like edges to more complex concepts. On the other hand, learning to be insensitive to invariances of the task, such as smooth transformations for image datasets, has been argued to be important for deep networks and it strongly correlates with their performance. In this work, we aim to explain this correlation and unify these two viewpoints. We show that by introducing sparsity to generative hierarchical models of data, the task acquires insensitivity to spatial transformations that are discrete versions of smooth transformations. In particular, we introduce the Sparse Random Hierarchy Model (SRHM), where we observe and rationalize that a hierarchical representation mirroring the hierarchical model is learnt precisely when such insensitivity is learnt, thereby explaining the strong correlation between the latter and performance. Moreover, we quantify how the sample complexity of CNNs learning the SRHM depends on both the sparsity and hierarchical structure of the task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. The staircase property: How hierarchical structure can guide deep learning. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  26989–27002. Curran Associates, Inc., 2021.
  2. ADef: an Iterative Algorithm to Construct Adversarial Deformations. September 2018. URL https://openreview.net/forum?id=Hk4dFjR5K7.
  3. Strike (With) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  4840–4849, Long Beach, CA, USA, June 2019. IEEE. ISBN 978-1-72813-293-8. doi: 10.1109/CVPR.2019.00498. URL https://ieeexplore.ieee.org/document/8954212/.
  4. How can deep learning performs deep (hierarchical) learning, 2023. URL https://openreview.net/forum?id=j2ymLjCr-Sj.
  5. Synthesizing Robust Adversarial Examples. In International Conference on Machine Learning, pp.  284–293. PMLR, July 2018. URL http://proceedings.mlr.press/v80/athalye18b.html. ISSN: 2640-3498.
  6. Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:1805.12177, 2018.
  7. Bach, F. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
  8. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature Communications, 13(1):2453, May 2022. ISSN 2041-1723. doi: 10.1038/s41467-022-29939-5. URL https://doi.org/10.1038/s41467-022-29939-5.
  9. Bietti, A. Approximation and learning with deep convolutional models: a kernel perspective, 2022.
  10. On the sample complexity of learning under invariance and geometric stability, 2021.
  11. Learning single-index models with shallow neural networks, 2022.
  12. Machine learning and invariant theory, 2023. URL https://arxiv.org/abs/2209.14991.
  13. On the opportunities and risks of foundation models, 2022.
  14. Language models are few-shot learners, 2020.
  15. Invariant scattering convolution networks. IEEE transactions on pattern analysis and machine intelligence, 35(8):1872–1886, 2013.
  16. What can be learnt with wide convolutional neural networks? In International Conference on Machine Learning, pp.  3347–3379. PMLR, 2023a.
  17. How deep neural networks learn compositional data: The random hierarchy model, 2023b.
  18. Castleman, K. R. Digital image processing. Prentice Hall Press, 1996.
  19. Learning the irreducible representations of commutative lie groups. In Xing, E. P. and Jebara, T. (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp.  1755–1763, Bejing, China, 22–24 Jun 2014. PMLR. URL https://proceedings.mlr.press/v32/cohen14.html.
  20. Group equivariant convolutional networks. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp.  2990–2999, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/cohenc16.html.
  21. Learning curves: Asymptotic values and rate of convergence. Advances in neural information processing systems, 6, 1993.
  22. DeGiuli, E. Random language model. Physical Review Letters, 122(12):128301, 2019.
  23. Exploiting cyclic symmetry in convolutional neural networks. arXiv preprint arXiv:1602.02660, 2016.
  24. Hierarchical nucleation in deep neural networks. Advances in Neural Information Processing Systems, 33:7526–7536, 2020.
  25. Elesedy, B. Provably strict generalisation benefit for invariance in kernel methods. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  17273–17283. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/8fe04df45a22b63156ebabbb064fcd5e-Paper.pdf.
  26. Provably strict generalisation benefit for equivariant models. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  2959–2969. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/elesedy21a.html.
  27. Exploring the Landscape of Spatial Robustness. In International Conference on Machine Learning, pp.  1802–1811. PMLR, May 2019. URL http://proceedings.mlr.press/v97/engstrom19a.html. ISSN: 2640-3498.
  28. The complexity of explaining neural networks through (group) invariants. In Hanneke, S. and Reyzin, L. (eds.), Proceedings of the 28th International Conference on Algorithmic Learning Theory, volume 76 of Proceedings of Machine Learning Research, pp.  341–359. PMLR, 15–17 Oct 2017. URL https://proceedings.mlr.press/v76/ensign17a.html.
  29. Locality defeats the curse of dimensionality in convolutional teacher-student scenarios. Advances in Neural Information Processing Systems, 34:9456–9467, 2021.
  30. Manitest: Are classifiers really invariant? In Procedings of the British Machine Vision Conference 2015, pp.  106.1–106.13, Swansea, 2015. British Machine Vision Association. ISBN 978-1-901725-53-7. doi: 10.5244/C.29.106. URL http://www.bmva.org/bmvc/2015/papers/paper106/index.html.
  31. A practical method for constructing equivariant multilayer perceptrons for arbitrary matrix groups. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  3318–3328. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/finzi21a.html.
  32. Fukushima, K. Cognitron: A self-organizing multilayered neural network. Biological Cybernetics, 20(3):121–136, September 1975. ISSN 1432-0770. doi: 10.1007/BF00342633. URL https://doi.org/10.1007/BF00342633.
  33. Grenander, U. Elements of pattern theory. JHU Press, 1996.
  34. Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  770–778, June 2016. doi: 10.1109/CVPR.2016.90.
  35. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
  36. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  37. Compositionality decomposed: how do neural networks generalise? (extended abstract). In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20, 2021. ISBN 9780999241165.
  38. Geometric Robustness of Deep Networks: Analysis and Improvement. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4441–4449, Salt Lake City, UT, June 2018. IEEE. ISBN 978-1-5386-6420-9. doi: 10.1109/CVPR.2018.00467. URL https://ieeexplore.ieee.org/document/8578565/.
  39. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  40. On translation invariance in cnns: Convolutional layers can exploit absolute spatial location. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14274–14285, 2020.
  41. On the generalization of equivariance and convolution in neural networks to the action of compact groups. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  2747–2755. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/kondor18a.html.
  42. le Cun, Y. Generalization and network design strategies. In Connectionism in perspective, volume 19, pp.  143–155, 1989. URL https://api.semanticscholar.org/CorpusID:244797850.
  43. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998. ISSN 1558-2256. doi: 10.1109/5.726791. Conference Name: Proceedings of the IEEE.
  44. Deep learning. Nature, 521(7553):436, 2015.
  45. Distance-based classification with lipschitz functions. The Journal of Machine Learning Research, 5(Jun):669–695, 2004.
  46. A provably correct algorithm for deep learning that actually works, 2018.
  47. The implications of local correlation on learning some deep functions. Advances in Neural Information Processing Systems, 33:1322–1332, 2020.
  48. Mallat, S. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150203, 2016.
  49. Learning with invariances in random features and kernel models. In Belkin, M. and Kpotufe, S. (eds.), Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pp.  3351–3418. PMLR, 15–19 Aug 2021. URL https://proceedings.mlr.press/v134/mei21a.html.
  50. Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration. Applied and Computational Harmonic Analysis, 59:3–84, 2022.
  51. Mossel, E. Deep learning and hierarchal generative models, 2018.
  52. Relative stability toward diffeomorphisms indicates performance in deep nets. Advances in Neural Information Processing Systems, 34:8727–8739, 2021.
  53. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5):503–519, 2017.
  54. Handbook of Formal Languages. Springer, January 1997. doi: 10.1007/978-3-642-59126-6.
  55. Pooling is neither necessary nor sufficient for appropriate deformation stability in CNNs. arXiv:1804.04438 [cs, stat], May 2018. URL http://arxiv.org/abs/1804.04438. arXiv: 1804.04438.
  56. Schmidt-Hieber, J. Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 48(4):1875–1897, 2020.
  57. A phase transition in diffusion models reveals the hierarchical nature of data, 2024.
  58. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR, 2015.
  59. Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124001, December 2020. ISSN 1742-5468. doi: 10.1088/1742-5468/abc61d. URL https://doi.org/10.1088/1742-5468/abc61d. Publisher: IOP Publishing.
  60. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In International Conference on Machine Learning, pp.  6105–6114. PMLR, May 2019. URL http://proceedings.mlr.press/v97/tan19a.html. ISSN: 2640-3498.
  61. How deep convolutional neural networks lose spatial information with training. Machine Learning: Science and Technology, 4(4):045026, nov 2023. doi: 10.1088/2632-2153/ad092c. URL https://dx.doi.org/10.1088/2632-2153/ad092c.
  62. Deep learning for computer vision: A brief review. Computational Intelligence and Neuroscience, pp.  1–13, 2018.
  63. Spatially Transformed Adversarial Examples. February 2018. URL https://openreview.net/forum?id=HyydRMZC-.
  64. Xiao, L. Eigenspace restructuring: a principle of space and frequency in neural networks. In Conference on Learning Theory, pp.  4888–4944. PMLR, 2022.
  65. Synergy and symmetry in deep learning: Interactions between the data, model, and inference algorithm. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  24347–24369. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/xiao22a.html.
  66. Visualizing and understanding convolutional networks. In Computer Vision – ECCV 2014, Lecture Notes in Computer Science, pp.  818–833, 2014.
  67. Zhang, R. Making convolutional networks shift-invariant again. arXiv preprint arXiv:1904.11486, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Umberto Tomasini (1 paper)
  2. Matthieu Wyart (89 papers)
Citations (5)

Summary

  • The paper shows that incorporating sparsity in hierarchical models fosters natural invariance to smooth, irrelevant transformations.
  • It demonstrates that CNNs leverage weight sharing to achieve quadratic, rather than exponential, sample complexity with increasing network depth.
  • The findings offer practical insights into neural architecture design, emphasizing efficiency in processing high-dimensional, sparse data.

Exploring the Intersection of Sparsity, Hierarchy, and Invariance in Deep Learning

Introduction to Sparse Random Hierarchy Model (SRHM)

The ability of deep networks to learn high-dimensional data is a foundational aspect of modern machine learning research. The success of such networks is often attributed to their capability to construct hierarchical representations of data, with each layer capturing increasingly abstract features. Parallel to this, the performance of deep learning models has been closely linked to their ability to remain invariant to certain transformations of the input data, such as smooth shifts in image datasets. The Sparse Random Hierarchy Model (SRHM) introduces a novel perspective by blending these two crucial aspects: hierarchical data representation and invariance to transformations. By incorporating sparsity within generative hierarchical models, the SRHM presents an analytical framework to reason about the correlation between a network's ability to ignore irrelevant data variations and its overall performance on tasks.

Contributions and Findings

  • Integration of Sparsity and Hierarchical Invariance: The paper shows that introducing sparsity in hierarchical models leads to a natural insensitivity to discretized smooth transformations. This aligns with the intuition that only a subset of features in data is relevant for classification while the rest can vary without impacting the classification outcome.
  • The Sparse Random Hierarchy Model (SRHM): A new model is introduced, demonstrating that hierarchical representations learned by networks coincide with the attainment of invariance to spatial transformations. This provides a quantitative basis for understanding the correlation between performance and invariance.
  • Quantification of Sample Complexity: The paper rigorously quantifies how the complexity of learning tasks by Convolutional Neural Networks (CNNs) is affected by both the task's sparsity and its hierarchical structure. A notable outcome is that for Locally Connected Networks (LCNs) without weight sharing, the sample complexity grows exponentially with the depth of the hierarchy. Conversely, for CNNs, this complexity exhibits a dependence only quadratic in the sparsity level, suggesting a significant advantage of weight sharing in exploiting hierarchical sparsity.

Implications and Speculations

The findings provide pivotal insights into the inner workings of deep learning models, specifically CNNs, when faced with sparsity and hierarchical data structures. The SRHM elucidates why models that can abstract away from irrelevant data variations tend to perform better, highlighting the importance of insensitivity to transformations as not just a performance enhancer but as a fundamental aspect of how neural networks apprehend hierarchical structures in data.

The clear separation in sample complexity between CNNs and LCNs underscores the inherent efficiency of weight sharing in handling sparse, hierarchical data—a principle that might inspire the design of future neural architectures optimized for such tasks.

Moving Forward

Looking ahead, the implications of the SRHM extend beyond the theoretical to potentially influence architectural choices in deep learning. The nuanced understanding of how sparsity and hierarchy work together opens new pathways for designing models that are inherently more efficient and interpretative. It prompts a reevaluation of the architectural elements of neural networks, especially in how they manage data invariance and hierarchy.

As the field moves forward, an intriguing avenue of exploration would be extending these concepts to unsupervised learning paradigms, investigating how models might discover and exploit hierarchical sparsity without explicit supervision. Additionally, the SRHM framework could further bridge the gap between how artificial models and biological systems process high-dimensional data, potentially informing more biologically plausible models of deep learning.

In conclusion, the Sparse Random Hierarchy Model (SRHM) contributes a significant piece to the puzzle of understanding deep learning, intertwining the principles of hierarchy, sparsity, and transformation invariance in a model that offers both theoretical and practical insights.