Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Generalization Ability of Unsupervised Pretraining (2403.06871v1)

Published 11 Mar 2024 in cs.LG and stat.ML

Abstract: Recent advances in unsupervised learning have shown that unsupervised pre-training, followed by fine-tuning, can improve model generalization. However, a rigorous understanding of how the representation function learned on an unlabeled dataset affects the generalization of the fine-tuned model is lacking. Existing theoretical research does not adequately account for the heterogeneity of the distribution and tasks in pre-training and fine-tuning stage. To bridge this gap, this paper introduces a novel theoretical framework that illuminates the critical factor influencing the transferability of knowledge acquired during unsupervised pre-training to the subsequent fine-tuning phase, ultimately affecting the generalization capabilities of the fine-tuned model on downstream tasks. We apply our theoretical framework to analyze generalization bound of two distinct scenarios: Context Encoder pre-training with deep neural networks and Masked Autoencoder pre-training with deep transformers, followed by fine-tuning on a binary classification task. Finally, inspired by our findings, we propose a novel regularization method during pre-training to further enhances the generalization of fine-tuned model. Overall, our results contribute to a better understanding of unsupervised pre-training and fine-tuning paradigm, and can shed light on the design of more effective pre-training algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning, pages 254–263. PMLR, 2018.
  2. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019a.
  3. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019b.
  4. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017.
  5. Olivier Bousquet. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. 01 2002.
  6. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in neural information processing systems, 32, 2019.
  7. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3349–3356, 2020.
  8. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  9. Learning feature representations with k-means. In Neural networks: Tricks of the trade, pages 561–580. Springer, 2012.
  10. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
  11. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
  15. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pages 5793–5831. PMLR, 2022.
  16. Masked autoencoders as spatiotemporal learners. arXiv preprint arXiv:2205.09113, 2022.
  17. On the provable advantage of unsupervised pretraining. arXiv preprint arXiv:2303.01566, 2023.
  18. Distance-based regularisation of deep networks for fine-tuning. In International Conference on Learning Representations, 2020.
  19. A no-free-lunch theorem for multitask learning. The Annals of Statistics, 50(6):3119–3143, 2022.
  20. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34:5000–5011, 2021.
  21. A non-generative framework and convex relaxations for unsupervised learning. Advances in Neural Information Processing Systems, 29, 2016.
  22. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  23. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  24. Robust fine-tuning of deep neural networks with hessian-based generalization guarantees. In International Conference on Machine Learning, pages 10431–10461. PMLR, 2022.
  25. Learning multiple layers of features from tiny images. 2009.
  26. Probability in Banach Spaces: Isoperimetry and Processes. Springer Science & Business Media, 2013.
  27. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34:309–323, 2021.
  28. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
  29. Improved regularization and robustness for fine-tuning in neural networks. Advances in Neural Information Processing Systems, 34:27249–27262, 2021.
  30. On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331, 2019.
  31. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pages 6083–6093. PMLR, 2020.
  32. Andreas Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, pages 3–17. Springer, 2016.
  33. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.
  34. What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
  35. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. arXiv preprint arXiv:2206.03126, 2022.
  36. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  37. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  38. Weakly-convex concave min-max optimization: Provable algorithms and applications in machine learning. arXiv preprint arXiv:1810.02060, 2018.
  39. Axel Ruhe. Perturbation bounds for means of eigenvalues and invariant subspaces. BIT Numerical Mathematics, 10:343–354, 1970. URL https://api.semanticscholar.org/CorpusID:122004897.
  40. A theoretical analysis of fine-tuning with linear teachers. Advances in Neural Information Processing Systems, 34:15382–15394, 2021.
  41. Smoothness, low noise and fast rates. Advances in neural information processing systems, 23, 2010.
  42. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019.
  43. On the theory of transfer learning: The importance of task diversity. Advances in neural information processing systems, 33:7852–7862, 2020.
  44. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
  45. Tinyvit: Fast pretraining distillation for small vision transformers. In 17th European Conference Computer Vision–ECCV 2022:, pages 68–85. Springer, 2022.
  46. Representation learning beyond linear prediction functions. Advances in Neural Information Processing Systems, 34:4792–4804, 2021.
  47. Analysis of information transfer from heterogeneous sources via precise high-dimensional asymptotics. arXiv preprint arXiv:2010.11750, 2020.
  48. On the trade-off of intra-/inter-class diversity for supervised pre-training. arXiv preprint arXiv:2305.12224, 2023.
  49. How mask matters: Towards theoretical understandings of masked autoencoders. arXiv preprint arXiv:2210.08344, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yuyang Deng (13 papers)
  2. Junyuan Hong (31 papers)
  3. Jiayu Zhou (70 papers)
  4. Mehrdad Mahdavi (50 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com