On the Generalization Ability of Unsupervised Pretraining (2403.06871v1)
Abstract: Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization. However, a rigorous understanding of how the representation function learned on an unlabeled dataset affects the generalization of the fine-tuned model is lacking. Existing theoretical research does not adequately account for the heterogeneity of the distribution and tasks in pre-training and fine-tuning stage. To bridge this gap, this paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase, ultimately affecting the generalization capabilities of the fine-tuned model on downstream tasks. We apply our theoretical framework to analyze generalization bound of two distinct scenarios: Context Encoder pre-training with deep neural networks and Masked Autoencoder pre-training with deep transformers, followed by fine-tuning on a binary classification task. Finally, inspired by our findings, we propose a novel regularization method during pre-training to further enhances the generalization of fine-tuned model. Overall, our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.
- Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning, pages 254–263. PMLR, 2018.
- Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019a.
- A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019b.
- Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017.
- Olivier Bousquet. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. 01 2002.
- Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in neural information processing systems, 32, 2019.
- Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3349–3356, 2020.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Learning feature representations with k-means. In Neural networks: Tricks of the trade, pages 561–580. Springer, 2012.
- An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
- Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
- Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pages 5793–5831. PMLR, 2022.
- Masked autoencoders as spatiotemporal learners. arXiv preprint arXiv:2205.09113, 2022.
- On the provable advantage of unsupervised pretraining. arXiv preprint arXiv:2303.01566, 2023.
- Distance-based regularisation of deep networks for fine-tuning. In International Conference on Learning Representations, 2020.
- A no-free-lunch theorem for multitask learning. The Annals of Statistics, 50(6):3119–3143, 2022.
- Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34:5000–5011, 2021.
- A non-generative framework and convex relaxations for unsupervised learning. Advances in Neural Information Processing Systems, 29, 2016.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- Robust fine-tuning of deep neural networks with hessian-based generalization guarantees. In International Conference on Machine Learning, pages 10431–10461. PMLR, 2022.
- Learning multiple layers of features from tiny images. 2009.
- Probability in Banach Spaces: Isoperimetry and Processes. Springer Science & Business Media, 2013.
- Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34:309–323, 2021.
- Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
- Improved regularization and robustness for fine-tuning in neural networks. Advances in Neural Information Processing Systems, 34:27249–27262, 2021.
- On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331, 2019.
- On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pages 6083–6093. PMLR, 2020.
- Andreas Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, pages 3–17. Springer, 2016.
- A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.
- What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
- Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. arXiv preprint arXiv:2206.03126, 2022.
- Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
- Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Weakly-convex concave min-max optimization: Provable algorithms and applications in machine learning. arXiv preprint arXiv:1810.02060, 2018.
- Axel Ruhe. Perturbation bounds for means of eigenvalues and invariant subspaces. BIT Numerical Mathematics, 10:343–354, 1970. URL https://api.semanticscholar.org/CorpusID:122004897.
- A theoretical analysis of fine-tuning with linear teachers. Advances in Neural Information Processing Systems, 34:15382–15394, 2021.
- Smoothness, low noise and fast rates. Advances in neural information processing systems, 23, 2010.
- Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019.
- On the theory of transfer learning: The importance of task diversity. Advances in neural information processing systems, 33:7852–7862, 2020.
- Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
- Tinyvit: Fast pretraining distillation for small vision transformers. In 17th European Conference Computer Vision–ECCV 2022:, pages 68–85. Springer, 2022.
- Representation learning beyond linear prediction functions. Advances in Neural Information Processing Systems, 34:4792–4804, 2021.
- Analysis of information transfer from heterogeneous sources via precise high-dimensional asymptotics. arXiv preprint arXiv:2010.11750, 2020.
- On the trade-off of intra-/inter-class diversity for supervised pre-training. arXiv preprint arXiv:2305.12224, 2023.
- How mask matters: Towards theoretical understandings of masked autoencoders. arXiv preprint arXiv:2210.08344, 2022.
- Yuyang Deng (13 papers)
- Junyuan Hong (31 papers)
- Jiayu Zhou (70 papers)
- Mehrdad Mahdavi (50 papers)