Theoretical Analysis of Inductive Biases in Deep Convolutional Networks (2305.08404v2)
Abstract: In this paper, we provide a theoretical analysis of the inductive biases in convolutional neural networks (CNNs). We start by examining the universality of CNNs, i.e., the ability to approximate any continuous functions. We prove that a depth of $\mathcal{O}(\log d)$ suffices for deep CNNs to achieve this universality, where $d$ in the input dimension. Additionally, we establish that learning sparse functions with CNNs requires only $\widetilde{\mathcal{O}}(\log2d)$ samples, indicating that deep CNNs can efficiently capture {\em long-range} sparse correlations. These results are made possible through a novel combination of the multichanneling and downsampling when increasing the network depth. We also delve into the distinct roles of weight sharing and locality in CNNs. To this end, we compare the performance of CNNs, locally-connected networks (LCNs), and fully-connected networks (FCNs) on a simple regression task, where LCNs can be viewed as CNNs without weight sharing. On the one hand, we prove that LCNs require ${\Omega}(d)$ samples while CNNs need only $\widetilde{\mathcal{O}}(\log2d)$ samples, highlighting the critical role of weight sharing. On the other hand, we prove that FCNs require $\Omega(d2)$ samples, whereas LCNs need only $\widetilde{\mathcal{O}}(d)$ samples, underscoring the importance of locality. These provable separations quantify the difference between the two biases, and the major observation behind our proof is that weight sharing and locality break different symmetries in the learning process.
- On the non-universality of deep learning: quantifying the cost of symmetry. In Advances in Neural Information Processing Systems, 2022.
- Francis Bach. Learning Theory from First Principles. The MIT Press, 2023.
- Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
- Alberto Bietti. Approximation and learning with deep convolutional models: a kernel perspective. In International Conference on Learning Representations, 2021.
- On the sample complexity of learning under geometric stability. Advances in Neural Information Processing Systems, 34:18673–18684, 2021.
- How wide convolutional neural networks learn hierarchical tasks. arXiv preprint arXiv:2208.01003, 2022.
- Hierarchically compositional tasks and deep convolutional networks. arXiv preprint arXiv:2006.13915, 2020.
- How many samples are needed to estimate a convolutional neural network? In Advances in Neural Information Processing Systems, 2018.
- A priori estimates of the population risk for two-layer neural networks. Communications in Mathematical Sciences, 17:1407–1425, 2019.
- The Barron space and the flow-induced function spaces for neural network models. Constructive Approximation, 55(1):369–406, 2022.
- The deep Ritz method: a deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6(1):1–12, 2018.
- Approximation analysis of CNNs from feature extraction view. arXiv preprint arXiv:2210.09041, 2022.
- Kunihiko Fukushima. Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural networks, 1(2):119–130, 1988.
- Implicit bias of gradient descent on linear convolutional networks. Advances in Neural Information Processing Systems, 2018.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
- Approximation properties of deep ReLU CNNs. Research in the Mathematical Sciences, 9(3):1–24, 2022.
- Information based complexity for high dimensional sparse functions. Journal of Complexity, 57:101443, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Approximation theory of convolutional architectures for time series modelling. In International Conference on Machine Learning, 2021.
- Inductive bias of multi-channel linear convolutional networks with bounded weight norm. In Conference on Learning Theory, 2022.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Risk bounds for high-dimensional ridge function combinations including neural networks. arXiv preprint arXiv:1607.01434, 2016.
- ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861–867, 1993.
- Generalization bounds for deep convolutional neural networks. In International Conference on Learning Representations, 2019.
- Shanshan Li, BoTang and Haijun Yu. Better approximations of high dimensional smooth functions by deep neural networks with rectified power units. Communications in Computational Physics, 27(2):379–411, 2019.
- Generalization bounds for convolutional neural networks. arXiv preprint arXiv:1910.01487, 2019.
- Why are convolutional nets more sample-efficient than fully-connected nets? In International Conference on Learning Representations, 2020.
- Learning with convolution and pooling operations in kernel methods. In Advances in Neural Information Processing Systems, 2022.
- Learning with invariances in random features and kernel models. In Conference on Learning Theory, 2021.
- Andrew Y Ng. Feature selection, L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT vs. L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization, and rotational invariance. In International conference on Machine learning, 2004.
- Learnability of convolutional neural networks for infinite dimensional input via mixed and anisotropic smoothness. In International Conference on Learning Representations, 2022.
- Why and when can deep – but not shallow – networks avoid the curse of dimensionality: a review, 2017.
- Implicit regularization in hierarchical tensor factorization and deep convolutional neural networks. arXiv preprint arXiv:2201.11729, 2022.
- Primer: Searching for efficient transformers for language modeling. arXiv preprint arXiv:2109.08668, 2021.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
- Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
- Synergy and symmetry in deep learning: Interactions between the data, model, and inference algorithm. In International Conference on Machine Learning, 2022.
- Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations, 2016.
- Ding-Xuan Zhou. Theory of deep convolutional neural networks: Downsampling. Neural Networks, 124:319–327, 2020.
- Ding-Xuan Zhou. Universality of deep convolutional neural networks. Applied and computational harmonic analysis, 48(2):787–794, 2020.