Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding Optimal Feature Transfer via a Fine-Grained Bias-Variance Analysis (2404.12481v1)

Published 18 Apr 2024 in stat.ML and cs.LG

Abstract: In the transfer learning paradigm models learn useful representations (or features) during a data-rich pretraining stage, and then use the pretrained representation to improve model performance on data-scarce downstream tasks. In this work, we explore transfer learning with the goal of optimizing downstream performance. We introduce a simple linear model that takes as input an arbitrary pretrained feature transform. We derive exact asymptotics of the downstream risk and its fine-grained bias-variance decomposition. Our finding suggests that using the ground-truth featurization can result in "double-divergence" of the asymptotic risk, indicating that it is not necessarily optimal for downstream performance. We then identify the optimal pretrained representation by minimizing the asymptotic downstream risk averaged over an ensemble of downstream tasks. Our analysis reveals the relative importance of learning the task-relevant features and structures in the data covariates and characterizes how each contributes to controlling the downstream risk from a bias-variance perspective. Moreover, we uncover a phase transition phenomenon where the optimal pretrained representation transitions from hard to soft selection of relevant features and discuss its connection to principal component regression.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. A random matrix perspective on mixtures of nonlinearities for deep learning. arXiv preprint arXiv:1912.00827, 2019.
  3. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pages 74–84. PMLR, 2020.
  4. Understanding double descent requires a fine-grained bias-variance decomposition. Advances in neural information processing systems, 33:11022–11032, 2020.
  5. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. Advances in Neural Information Processing Systems, 35:37932–37946, 2022.
  6. The lasso risk for gaussian matrices. IEEE Transactions on Information Theory, 58(4):1997–2017, 2011.
  7. Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020.
  8. Meta learning in bandits within shared affine subspaces, 2024.
  9. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pages 120–128, 2006.
  10. Convex optimization. Cambridge university press, 2004.
  11. Algorithmic analysis and statistical estimation of slope via approximate message passing. Advances in Neural Information Processing Systems, 32, 2019.
  12. A meta-learning approach for graph representation learning in multi-task settings. arXiv preprint arXiv:2012.06755, 2020.
  13. Graph representation learning: a survey. APSIPA Transactions on Signal and Information Processing, 9:e15, 2020.
  14. Dimension free ridge regression. arXiv preprint arXiv:2210.08571, 2022.
  15. How fine-tuning allows for effective meta-learning. Advances in Neural Information Processing Systems, 34:8871–8884, 2021.
  16. Co-clustering based classification for out-of-domain documents. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 210–219, 2007.
  17. Lee H. Dicker. Ridge regression and asymptotic minimax estimation over spheres of growing dimension. Bernoulli, 22(1):1 – 37, 2016.
  18. High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247–279, 2018.
  19. High dimensional robust m-estimation: Asymptotic variance via approximate message passing. Probability Theory and Related Fields, 166:935–969, 2016.
  20. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
  21. Why does unsupervised pre-training help deep learning? In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 201–208. JMLR Workshop and Conference Proceedings, 2010.
  22. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
  23. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029 – 1054, 2021.
  24. Tikhonov regularization and total least squares. SIAM journal on matrix analysis and applications, 21(1):185–194, 1999.
  25. Matrix Computations 3rd Edition. JHU press, 2013.
  26. Surprises in high-dimensional ridgeless least squares interpolation. Annals of statistics, 50(2):949, 2022.
  27. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  28. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  29. Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021.
  30. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  31. Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 69(3):1932–1964, 2022.
  32. Strategies for pre-training graph neural networks. arXiv preprint arXiv:1905.12265, 2019.
  33. Revisiting scalarization in multi-task learning: A theoretical perspective. Advances in Neural Information Processing Systems, 36, 2024.
  34. Matrix backpropagation for deep networks with structured layers. In Proceedings of the IEEE international conference on computer vision, pages 2965–2973, 2015.
  35. Learning what and where to transfer. In International conference on machine learning, pages 3030–3039. PMLR, 2019.
  36. Meta-learning with generalized ridge regression: High-dimensional asymptotics, optimality and hyper-covariance estimation. arXiv preprint arXiv:2403.19720, 2024.
  37. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  38. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  39. Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169:257–352, 2017.
  40. Tonu Kollo. Advanced multivariate statistics with matrices. Springer, 2005.
  41. Robust meta-learning for mixed linear regression with small batches. Advances in neural information processing systems, 33:4683–4696, 2020.
  42. Rethinking few-shot object detection on a multi-domain benchmark. In European Conference on Computer Vision, pages 366–382. Springer, 2022.
  43. Li Li et al. Selected applications of convex optimization, volume 103. Springer, 2015.
  44. Universal representation learning from multiple domains for few-shot classification. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9526–9535, 2021.
  45. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
  46. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  47. The distribution of the lasso: Uniform control over sparse balls and adaptive parameter tuning. The Annals of Statistics, 49(4):2313–2335, 2021.
  48. Optimal regularization can mitigate double descent. arXiv preprint arXiv:2003.01897, 2020.
  49. Provable guarantees for nonlinear feature learning in three-layer neural networks. Advances in Neural Information Processing Systems, 36, 2024.
  50. An analysis of unsupervised pre-training in light of recent advances. arXiv preprint arXiv:1412.6597, 2014.
  51. The unsurprising effectiveness of pre-trained vision models for control. In international conference on machine learning, pages 17359–17371. PMLR, 2022.
  52. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  53. To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987, 2019.
  54. Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pages 759–766, 2007.
  55. Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676, 2018.
  56. Smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 62(12):1707–1739, 2009.
  57. Hanson-Wright inequality and sub-gaussian concentration. Electronic Communications in Probability, 18(none):1 – 9, 2013.
  58. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
  59. Transfer representation learning for medical image analysis. In 2015 37th annual international conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 711–714. IEEE, 2015.
  60. Representation transfer learning via multiple pre-trained models for linear regression. In 2023 IEEE International Symposium on Information Theory (ISIT). IEEE, June 2023.
  61. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 403–412, 2019.
  62. Towards sample-efficient overparameterized meta-learning. Advances in Neural Information Processing Systems, 34:28156–28168, 2021.
  63. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516–14525, 2019.
  64. Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Society, 2023.
  65. Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434–10443. PMLR, 2021.
  66. On the theory of transfer learning: The importance of task diversity. Advances in neural information processing systems, 33:7852–7862, 2020.
  67. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020.
  68. Transfer feature representation via multiple kernel learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
  69. Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (csur), 53(3):1–34, 2020.
  70. Improved active multi-task representation learning via lasso. In International Conference on Machine Learning, pages 35548–35578. PMLR, 2023.
  71. James F Ward, Jr. On a limit formula for weighted pseudoinverses. SIAM Journal on Applied Mathematics, 33(1):34–38, 1977.
  72. Optimistic rates for multi-task representation learning. Advances in Neural Information Processing Systems, 36, 2024.
  73. Denny Wu and Ji Xu. On the optimal weighted ℓ2subscriptℓ2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization in overparameterized linear regression. Advances in Neural Information Processing Systems, 33:10112–10123, 2020.
  74. Ji Xu and Daniel J Hsu. On the number of variables to use in principal component regression. Advances in neural information processing systems, 32, 2019.
  75. Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522, 2020.
  76. Understanding how pretraining regularizes deep learning algorithms. IEEE Transactions on Neural Networks and Learning Systems, 2021.
  77. Bridging domains with approximately shared features. arXiv preprint arXiv:2403.06424, 2024.
  78. Not all features matter: Enhancing few-shot clip with adaptive prior refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2605–2615, 2023.
  79. Rethinking pre-training and self-training. Advances in neural information processing systems, 33:3833–3845, 2020.
Citations (1)

Summary

We haven't generated a summary for this paper yet.