Residual Connections Harm Generative Representation Learning (2404.10947v4)
Abstract: We show that introducing a weighting factor to reduce the influence of identity shortcuts in residual networks significantly enhances semantic feature learning in generative representation learning frameworks, such as masked autoencoders (MAEs) and diffusion models. Our modification notably improves feature quality, raising ImageNet-1K K-Nearest Neighbor accuracy from 27.4% to 63.9% and linear probing accuracy from 67.8% to 72.7% for MAEs with a ViT-B/16 backbone, while also enhancing generation quality in diffusion models. This significant gap suggests that, while residual connection structure serves an essential role in facilitating gradient propagation, it may have a harmful side effect of reducing capacity for abstract learning by virtue of injecting an echo of shallower representations into deeper layers. We ameliorate this downside via a fixed formula for monotonically decreasing the contribution of identity connections as layer depth increases. Our design promotes the gradual development of feature abstractions, without impacting network trainability. Analyzing the representations learned by our modified residual networks, we find correlation between low effective feature rank and downstream task performance.
- Gpt-4 technical report. arXiv:2303.08774, 2023.
- Implicit regularization in deep matrix factorization. NeurIPS, 2019.
- Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, 2023.
- Data2vec: A general framework for self-supervised learning in speech, vision and language. In ICML, 2022.
- Beit: Bert pre-training of image transformers. arXiv:2106.08254, 2021.
- Mechanism of feature learning in convolutional neural networks. arXiv:2309.00570, 2023.
- Representation learning: A review and new perspectives. TPAMI, 2013.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- A simple framework for contrastive learning of visual representations. In ICML, 2020a.
- Big self-supervised models are strong semi-supervised learners. NeurIPS, 2020b.
- Improved baselines with momentum contrastive learning. arXiv:2003.04297, 2020c.
- An empirical study of training self-supervised vision transformers. In ICCV, 2021.
- Context autoencoder for self-supervised representation learning. IJCV, 2024.
- Y. Cheng. Mean shift, mode seeking, and clustering. TPAMI, 1995.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In ICML, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929, 2020.
- Optimal signal propagation in ResNets through residual scaling. arXiv:2305.07715, 2023.
- The emergence of clusters in self-attention dynamics. NeurIPS, 2024.
- X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In ICAIS, 2010.
- Generative adversarial nets. NeurIPS, 2014.
- Highway and residual networks learn unrolled iterative estimation. arXiv:1612.07771, 2016.
- Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020.
- Deep residual learning for image recognition. In CVPR, 2016.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.
- Densely connected convolutional networks. In CVPR, 2017.
- Contrastive masked autoencoders are stronger vision learners. TPAMI, 2023.
- The low-rank simplicity bias in deep networks. arXiv:2103.10427, 2021.
- S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- Implicit rank-minimizing autoencoder. NeurIPS, 2020.
- Progressive growing of gans for improved quality, stability, and variation. arXiv:1710.10196, 2017.
- Analyzing and improving the image quality of stylegan. In CVPR, 2020.
- A style-based generator architecture for generative adversarial networks. TPAMI, 2021.
- Elucidating the design space of diffusion-based generative models. NeurIPS, 2022.
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013.
- Segment anything. arXiv:2304.02643, 2023.
- Large language models are zero-shot reasoners. NeurIPS, 2022.
- FractalNet: Ultra-deep neural networks without residuals. ICLR, 2017.
- A systematic study and comprehensive evaluation of chatgpt on benchmark datasets. arXiv:2305.18486, 2023.
- Mage: Masked generative encoder to unify representation learning and image synthesis. In CVPR, 2023.
- I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv:1711.05101, 2017.
- Towards understanding the role of over-parametrization in generalization of neural networks. arXiv:1805.12076, 2018.
- Feature learning in neural networks and kernel machines that recursively learn features. arXiv:2212.13881, 2022.
- Linear recursive feature machines provably recover low-rank matrices. arXiv:2401.04553, 2024.
- Zero-shot text-to-image generation. In ICML, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- O. Roy and M. Vetterli. The effective rank: A measure of effective dimensionality. In ESPC, 2007.
- P. Savarese and D. Figueiredo. Residual gates: A simple mechanism for improved network optimization. In ICLR, 2017.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
- Improving image captioning with better use of captions. arXiv:2006.11807, 2020.
- The effectiveness of mae pre-pretraining for billion-scale pretraining. arXiv:2303.13496, 2023.
- Denoising diffusion implicit models. ICLR, 2021.
- Score-based generative modeling through stochastic differential equations. arXiv:2011.13456, 2020.
- Consistency models. arXiv:2303.01469, 2023.
- Highway networks. arXiv:1505.00387, 2015a.
- Training very deep networks. NeurIPS, 2015b.
- Going deeper with convolutions. In CVPR, 2015.
- Gemini: a family of highly capable multimodal models. arXiv:2312.11805, 2023.
- Attention is all you need. NeurIPS, 2017.
- Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
- ibot: Image bert pre-training with online tokenizer. arXiv:2111.07832, 2021.
- Sparsely aggregated convolutional networks. In ECCV, 2018a.
- Sparsely aggregated convolutional networks. In ECCV, 2018b.