Bottleneck Structure in Learned Features: Low-Dimension vs Regularity Tradeoff (2305.19008v3)
Abstract: Previous work has shown that DNNs with large depth $L$ and $L_{2}$-regularization are biased towards learning low-dimensional representations of the inputs, which can be interpreted as minimizing a notion of rank $R{(0)}(f)$ of the learned function $f$, conjectured to be the Bottleneck rank. We compute finite depth corrections to this result, revealing a measure $R{(1)}$ of regularity which bounds the pseudo-determinant of the Jacobian $\left|Jf(x)\right|{+}$ and is subadditive under composition and addition. This formalizes a balance between learning low-dimensional representations and minimizing complexity/irregularity in the feature maps, allowing the network to learn the `right' inner dimension. Finally, we prove the conjectured bottleneck structure in the learned features as $L\to\infty$: for large depths, almost all hidden representations are approximately $R{(0)}(f)$-dimensional, and almost all weight matrices $W{\ell}$ have $R{(0)}(f)$ singular values close to 1 while the others are $O(L{-\frac{1}{2}})$. Interestingly, the use of large learning rates is required to guarantee an order $O(L)$ NTK which in turns guarantees infinite depth convergence of the representations of almost all layers.
- Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, 2018.
- High-dimensional limit theorems for SGD: Effective dynamics and critical scaling. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
- The shortest path through many points. Mathematical Proceedings of the Cambridge Philosophical Society, 55(4):299–327, 1959.
- Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 1305–1338. PMLR, 09–12 Jul 2020.
- Representation costs of linear neural networks: Analysis and design. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
- Label noise sgd provably prefers flat global minimizers. Advances in Neural Information Processing Systems, 34:27449–27461, 2021.
- Characterizing implicit bias in terms of optimization geometry. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1832–1841. PMLR, 10–15 Jul 2018.
- Relu deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973, 2018.
- Arthur Jacot. Implicit bias of large depth networks: a notion of rank for nonlinear functions. In The Eleventh International Conference on Learning Representations, 2023.
- Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Advances in Neural Information Processing Systems 31, pages 8580–8589. Curran Associates, Inc., 2018.
- The asymptotic spectrum of the hessian of dnn throughout training. In International Conference on Learning Representations, 2020.
- Feature learning in l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-regularized dnns: Attraction/repulsion and sparsity. In Advances in Neural Information Processing Systems, volume 36, 2022.
- Training invariances and the low-rank phenomenon: beyond linear networks. In International Conference on Learning Representations, 2022.
- What happens after sgd reaches zero loss?–a mathematical framework. arXiv preprint arXiv:2110.06914, 2021.
- The role of linear layers in nonlinear interpolating networks. arXiv preprint arXiv:2202.00856, 2022.
- Houman Owhadi. Do ideas have shape? plato’s theory of forms as the continuous limit of artificial neural networks. arXiv preprint arXiv:2008.03920, 2020.
- The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
- Implicit regularization towards rank minimization in relu networks. arXiv preprint arXiv:2201.12760, 2022.
- Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pages 1–5. IEEE, 2015.
- Implicit bias of sgd in l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-regularized linear dnns: One-way jumps from high to low rank, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.