Residual Alignment: Uncovering the Mechanisms of Residual Networks (2401.09018v1)
Abstract: The ResNet architecture has been widely adopted in deep learning due to its significant boost to performance through the use of simple skip connections, yet the underlying mechanisms leading to its success remain largely unknown. In this paper, we conduct a thorough empirical study of the ResNet architecture in classification tasks by linearizing its constituent residual blocks using Residual Jacobians and measuring their singular value decompositions. Our measurements reveal a process called Residual Alignment (RA) characterized by four properties: (RA1) intermediate representations of a given input are equispaced on a line, embedded in high dimensional space, as observed by Gai and Zhang [2021]; (RA2) top left and right singular vectors of Residual Jacobians align with each other and across different depths; (RA3) Residual Jacobians are at most rank C for fully-connected ResNets, where C is the number of classes; and (RA4) top singular values of Residual Jacobians scale inversely with depth. RA consistently occurs in models that generalize well, in both fully-connected and convolutional architectures, across various depths and widths, for varying numbers of classes, on all tested benchmark datasets, but ceases to occur once the skip connections are removed. It also provably occurs in a novel mathematical model we propose. This phenomenon reveals a strong alignment between residual branches of a ResNet (RA2+4), imparting a highly rigid geometric structure to the intermediate representations as they progress linearly through the network (RA1) up to the final layer, where they undergo Neural Collapse.
- Deep equilibrium models. Advances in Neural Information Processing Systems, 32, 2019.
- Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
- Scaling properties of deep residual networks. In International Conference on Machine Learning, pages 2039–2048. PMLR, 2021.
- Asymptotic analysis of deep residual networks, 2023.
- On the emergence of tetrahedral symmetry in the final and penultimate layers of neural network classifiers. arXiv preprint arXiv:2012.05420, 2020.
- Residual connections encourage iterative inference. In International Conference on Learning Representations, 2018.
- Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019.
- A mathematical principle of deep learning: Learn the geodesic curve in the wasserstein space. arXiv preprint arXiv:2102.09235, 2021.
- On the implicit bias towards minimal depth of deep neural networks, 2022.
- Highway and residual networks learn unrolled iterative estimation. arXiv preprint arXiv:1612.07771, 2016.
- Neural collapse under mse loss: Proximity to and dynamics on the central path. In International Conference on Learning Representations, 2021.
- Soufiane Hayou. On the infinite-depth limit of finite-width neural networks. arXiv preprint arXiv:2210.00688, 2022.
- A law of data separation in deep learning, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016a.
- Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016b.
- Jeremy Howard. GitHub - fastai/imagenette: A smaller subset of 10 easily classified classes from Imagenet, and a little more French — github.com. https://github.com/fastai/imagenette. [Accessed 14-May-2023].
- Deep networks with stochastic depth, 2016.
- Neural collapse: A review on modelling principles and generalization. arXiv preprint arXiv:2206.04041, 2022.
- Algorithm 971: An implementation of a randomized algorithm for principal component analysis. ACM Trans. Math. Softw., 43(3), jan 2017. ISSN 0098-3500. doi: 10.1145/3004053. URL https://doi.org/10.1145/3004053.
- The future is log-gaussian: Resnets and their infinite-depth-and-width limit at initialization. Advances in Neural Information Processing Systems, 34:7852–7864, 2021.
- The neural covariance sde: Shaped infinite depth-and-width networks at initialization. arXiv preprint arXiv:2206.02768, 2022.
- Demystifying resnet. arXiv preprint arXiv:1611.01186, 2016.
- Principled and efficient transfer learning of deep models via neural collapse, 2023.
- Ensemble of one model: Creating model variations for transformer with layer permutation. In 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 1026–1030. IEEE, 2021.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Neural collapse with cross-entropy loss. arXiv preprint arXiv:2012.08465, 2020.
- A mean field analysis of deep resnet and beyond: Towards provably optimization via overparameterization from depth. In International Conference on Machine Learning, pages 6426–6436. PMLR, 2020.
- A trace inequality with a subtracted term. Linear algebra and its applications, 185:165–172, 1993.
- Leon Mirsky. A trace inequality of john von neumann. Monatshefte für mathematik, 79(4):303–306, 1975.
- Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619, 2020.
- Unique properties of flat minima in deep networks. In International Conference on Machine Learning, pages 7108–7118. PMLR, 2020.
- Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760, 2018.
- Vardan Papyan. Traces of class/cross-class structure pervade deep learning spectra, 2020.
- Convolutional neural networks analyzed via convolutional sparse coding. The Journal of Machine Learning Research, 18(1):2887–2938, 2017.
- Prevalence of Neural Collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
- Explicit regularization and implicit bias in deep network classifiers trained with the square loss. arXiv preprint arXiv:2101.00072, 2020.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- Residual networks as nonlinear systems: Stability analysis using linearization. 2019.
- Do residual neural networks discretize neural ordinary differential equations? arXiv preprint arXiv:2205.14612, 2022.
- Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- Highway networks. CoRR, abs/1505.00387, 2015. URL http://arxiv.org/abs/1505.00387.
- Extended unconstrained features model for exploring deep neural collapse. In international conference on machine learning (ICML), 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016.
- Linear convergence analysis of neural collapse with unconstrained features. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop).
- Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
- Wide residual networks. In British Machine Vision Conference 2016. British Machine Vision Association, 2016.
- A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820–29834, 2021.