Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Inductive biases of multi-task learning and finetuning: multiple regimes of feature reuse (2310.02396v4)

Published 3 Oct 2023 in cs.LG

Abstract: Neural networks are often trained on multiple tasks, either simultaneously (multi-task learning, MTL) or sequentially (pretraining and subsequent finetuning, PT+FT). In particular, it is common practice to pretrain neural networks on a large auxiliary task before finetuning on a downstream task with fewer samples. Despite the prevalence of this approach, the inductive biases that arise from learning multiple tasks are poorly characterized. In this work, we address this gap. We describe novel implicit regularization penalties associated with MTL and PT+FT in diagonal linear networks and single-hidden-layer ReLU networks. These penalties indicate that MTL and PT+FT induce the network to reuse features in different ways. 1) Both MTL and PT+FT exhibit biases towards feature reuse between tasks, and towards sparsity in the set of learned features. We show a "conservation law" that implies a direct tradeoff between these two biases. 2) PT+FT exhibits a novel "nested feature selection" regime, not described by either the "lazy" or "rich" regimes identified in prior work, which biases it to rely on a sparse subset of the features learned during pretraining. This regime is much narrower for MTL. 3) PT+FT (but not MTL) in ReLU networks benefits from features that are correlated between the auxiliary and main task. We confirm these findings empirically with teacher-student models, and introduce a technique -- weight rescaling following pretraining -- that can elicit the nested feature selection regime. Finally, we validate our theory in deep neural networks trained on image classification. We find that weight rescaling improves performance when it causes models to display signatures of nested feature selection. Our results suggest that nested feature selection may be an important inductive bias for finetuning neural networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent. In Proceedings of the 38th International Conference on Machine Learning, pp.  468–477. PMLR, July 2021. URL https://proceedings.mlr.press/v139/azulay21a.html.
  2. On the Opportunities and Risks of Foundation Models, July 2022. URL http://arxiv.org/abs/2108.07258. arXiv:2108.07258 [cs].
  3. Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs. May 2022. URL https://openreview.net/forum?id=L74c-iUxQ1I.
  4. Exact learning dynamics of deep linear networks with prior knowledge. Advances in Neural Information Processing Systems, 35:6615–6629, December 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/2b3bb2c95195130977a51b3bb251c40a-Abstract-Conference.html.
  5. Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss. In Proceedings of Thirty Third Conference on Learning Theory, pp.  1305–1338. PMLR, July 2020. URL https://proceedings.mlr.press/v125/chizat20a.html. ISSN: 2640-3498.
  6. Representation Costs of Linear Neural Networks: Analysis and Design. In Advances in Neural Information Processing Systems, volume 34, pp.  26884–26896. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/hash/e22cb9d6bbb4c290a94e4fff4d68a831-Abstract.html.
  7. Should We Be Pre-training? An Argument for End-task Aware Training as an Alternative. October 2021. URL https://openreview.net/forum?id=2bO2x8NAIMB.
  8. A Survey of Vision-Language Pre-Trained Models, July 2022. URL http://arxiv.org/abs/2202.10936. arXiv:2202.10936 [cs].
  9. A theory of multineuronal dimensionality, dynamics and measurement, November 2017. URL https://www.biorxiv.org/content/10.1101/214262v2.
  10. The effective number of shared dimensions: A simple method for revealing shared structure between datasets, July 2023. URL https://www.biorxiv.org/content/10.1101/2023.07.27.550815v1.
  11. Implicit Bias of Gradient Descent on Linear Convolutional Networks. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/hash/0e98aeeb54acf612b9eb4e48a269814c-Abstract.html.
  12. Shape Matters: Understanding the Implicit Bias of the Noise Covariance. In Proceedings of Thirty Fourth Conference on Learning Theory, pp.  2315–2357. PMLR, July 2021. URL https://proceedings.mlr.press/v134/haochen21a.html.
  13. The Low-Rank Simplicity Bias in Deep Networks, March 2023. URL http://arxiv.org/abs/2103.10427. arXiv:2103.10427 [cs].
  14. Reinforcement Learning with Unsupervised Auxiliary Tasks, November 2016. URL http://arxiv.org/abs/1611.05397. arXiv:1611.05397 [cs].
  15. Similarity of Neural Network Representations Revisited. In Proceedings of the 36th International Conference on Machine Learning, pp.  3519–3529. PMLR, May 2019a. URL https://proceedings.mlr.press/v97/kornblith19a.html.
  16. Do better ImageNet models transfer better? 2019b. URL https://arxiv.org/pdf/1805.08974.pdf.
  17. Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution, February 2022. URL http://arxiv.org/abs/2202.10054. arXiv:2202.10054 [cs].
  18. Gradient Descent Maximizes the Margin of Homogeneous Neural Networks, December 2020. URL http://arxiv.org/abs/1906.05890. arXiv:1906.05890 [cs, stat].
  19. The Benefit of Multitask Representation Learning. Journal of Machine Learning Research, 17(81):1–32, 2016. ISSN 1533-7928. URL http://jmlr.org/papers/v17/15-242.html.
  20. Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy. In Advances in Neural Information Processing Systems, volume 33, pp.  22182–22193. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/fc2022c89b61c76bbef978f1370660bf-Abstract.html.
  21. The Implicit Bias of Minima Stability: A View from Function Space. In Advances in Neural Information Processing Systems, volume 34, pp.  17749–17761. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/hash/944a5ae3483ed5c1e10bbccb7942a279-Abstract.html.
  22. Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models. In Proceedings of the 36th International Conference on Machine Learning, pp.  4683–4692. PMLR, May 2019. URL https://proceedings.mlr.press/v97/nacson19a.html. ISSN: 2640-3498.
  23. Neyshabur, B. Implicit Regularization in Deep Learning, September 2017. URL http://arxiv.org/abs/1709.01953. arXiv:1709.01953 [cs].
  24. Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity. November 2021. URL https://openreview.net/forum?id=vvi7KqHQiA.
  25. On the Spectral Bias of Neural Networks. September 2018. URL https://openreview.net/forum?id=r1gR2sC9FX.
  26. Implicit Regularization in Deep Learning May Not Be Explainable by Norms. In Advances in Neural Information Processing Systems, volume 33, pp.  21174–21187. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/f21e255f89e0f258accbe4e984eef486-Abstract.html.
  27. Neural networks trained with SGD learn distributions of increasing complexity, May 2023. URL http://arxiv.org/abs/2211.11567. arXiv:2211.11567 [cond-mat, stat].
  28. How do infinite width bounded norm networks look in function space?: 32nd Conference on Learning Theory, COLT 2019. Proceedings of Machine Learning Research, 99:2667–2690, 2019. URL http://www.scopus.com/inward/record.url?scp=85132757852&partnerID=8YFLogxK.
  29. A Theoretical Analysis of Fine-tuning with Linear Teachers. In Advances in Neural Information Processing Systems, volume 34, pp.  15382–15394. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/hash/82039d16dce0aab3913b6a7ac73deff7-Abstract.html.
  30. Upstream Mitigation Is Not All You Need: Testing the Bias Transfer Hypothesis in Pre-Trained Language Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3524–3542, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.247. URL https://aclanthology.org/2022.acl-long.247.
  31. A Brief Review of Deep Multi-task Learning and Auxiliary Task Learning, July 2020. URL http://arxiv.org/abs/2007.01126. arXiv:2007.01126 [cs, stat].
  32. Overwriting Pretrained Bias with Finetuning Data, August 2023. URL http://arxiv.org/abs/2303.06167. arXiv:2303.06167 [cs].
  33. When to Use Multi-Task Learning vs Intermediate Fine-Tuning for Pre-Trained Encoder Transfer Learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  272–282, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.30. URL https://aclanthology.org/2022.acl-short.30.
  34. Kernel and Rich Regimes in Overparametrized Models. In Proceedings of Thirty Third Conference on Learning Theory, pp.  3635–3673. PMLR, July 2020. URL https://proceedings.mlr.press/v125/woodworth20a.html. ISSN: 2640-3498.
  35. Understanding and Improving Information Transfer in Multi-Task Learning, May 2020. URL http://arxiv.org/abs/2005.00944. arXiv:2005.00944 [cs].
  36. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006. ISSN 1467-9868. doi: 10.1111/j.1467-9868.2005.00532.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2005.00532.x. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2005.00532.x.
  37. A Survey on Multi-Task Learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609, December 2022. ISSN 1558-2191. doi: 10.1109/TKDE.2021.3070203. Conference Name: IEEE Transactions on Knowledge and Data Engineering.
  38. A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT, May 2023. URL http://arxiv.org/abs/2302.09419. arXiv:2302.09419 [cs].

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com