Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine-tuning with Very Large Dropout (2403.00946v3)

Published 1 Mar 2024 in cs.LG and cs.CV

Abstract: It is impossible today to pretend that the practice of machine learning is always compatible with the idea that training and testing data follow the same distribution. Several authors have recently used ensemble techniques to show how scenarios involving multiple data distributions are best served by representations that are both richer than those obtained by regularizing for the best in-distribution performance, and richer than those obtained under the influence of the implicit sparsity bias of common stochastic gradient procedures. This contribution investigates the use of very high dropout rates instead of ensembles to obtain such rich representations. Although training a deep network from scratch using such dropout rates is virtually impossible, fine-tuning a large pre-trained model under such conditions is not only possible but also achieves out-of-distribution performances that exceed those of both ensembles and weight averaging methods such as model soups. This result has practical significance because the importance of the fine-tuning scenario has considerably grown in recent years. This result also provides interesting insights on the nature of rich representations and on the intrinsically linear nature of fine-tuning a large network using a comparatively small dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Sgd with large step sizes learns sparse features. In International Conference on Machine Learning, pp.  903–925. PMLR, 2023.
  2. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  3. Ensemble of averages: Improving model selection and boosting performance in domain generalization. Advances in Neural Information Processing Systems, 35:8265–8277, 2022.
  4. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019.
  5. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
  6. Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV), pp.  456–473, 2018.
  7. Universal representations: The missing link between faces, text, planktons, and cat breeds. arXiv preprint arXiv:1701.07275, 2017.
  8. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on learning theory, pp.  483–513. PMLR, 2020.
  9. On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021a. URL https://arxiv.org/abs/2108.07258.
  10. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021b.
  11. Bottou, L. From machine learning to machine reasoning. Technical report, arXiv:1102.1808, February 2011.
  12. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
  13. Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34:22405–22418, 2021.
  14. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.  1597–1607. PMLR, 2020.
  15. Towards understanding feature learning in out-of-distribution generalization. arXiv preprint arXiv:2304.11327, 2023.
  16. On lazy training in differentiable programming, 2020.
  17. Few-shot image classification: Just use a library of pre-trained feature extractors and a simple classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9445–9454, 2021.
  18. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493–2537, Aug 2011.
  19. Dietterich, T. G. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp.  1–15. Springer, 2000.
  20. Selecting relevant features from a multi-domain representation for few-shot classification. In European Conference on Computer Vision, pp.  769–786. Springer, 2020.
  21. Head2toe: Utilizing intermediate representations for better transfer learning. In International Conference on Machine Learning, pp.  6009–6033. PMLR, 2022.
  22. Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In Proceedings of the IEEE International Conference on Computer Vision, pp.  1657–1664, 2013.
  23. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp.  3259–3269. PMLR, 2020.
  24. No one representation to rule them all: Overlapping features of training methods. arXiv preprint arXiv:2110.12899, 2021.
  25. In search of lost domain generalization. arXiv preprint arXiv:2007.01434, 2020.
  26. Control batch size and learning rate to generalize well: Theoretical and empirical evidence. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/dc6a70712a252123c40d2adba6a11d84-Paper.pdf.
  27. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  28. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8340–8349, 2021a.
  29. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15262–15271, 2021b.
  30. Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp.  646–661. Springer, 2016.
  31. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
  32. Exploiting generative models in discriminative classifiers. Advances in neural information processing systems, 11, 1998.
  33. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
  34. Out-of-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, pp.  5815–5826. PMLR, 2021.
  35. Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054, 2022.
  36. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp.  5542–5550, 2017.
  37. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018a.
  38. Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018b.
  39. Universal representation learning from multiple domains for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9526–9535, 2021.
  40. Universal representations: A unified look at multiple task and domain learning. arXiv preprint arXiv:2204.02744, 2022.
  41. Fast adaptation with linearized neural networks. In International Conference on Artificial Intelligence and Statistics, pp.  2737–2745. PMLR, 2021.
  42. Gradients as features for deep representation learning. In International Conference on Learning Representations, 2019.
  43. Trivialaugment: Tuning-free yet state-of-the-art data augmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  774–782, 2021.
  44. Path-sgd: Path-normalized optimization in deep neural networks, 2015.
  45. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp.  1406–1415, 2019.
  46. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. doi: 10.1137/0330046.
  47. Recycling diverse models for out-of-distribution generalization. arXiv preprint arXiv:2212.10445, 2022a.
  48. Diverse weight averaging for out-of-distribution generalization. Advances in Neural Information Processing Systems, 35:10821–10836, 2022b.
  49. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pp.  5389–5400. PMLR, 2019.
  50. Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp.  806–813, 2014.
  51. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  52. Deep coral: Correlation alignment for deep domain adaptation. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14, pp.  443–450. Springer, 2016.
  53. Generalization error of ensemble estimators. In Proceedings of International Conference on Neural Networks (ICNN’96), volume 1, pp.  90–95 vol.1, 1996. doi: 10.1109/ICNN.1996.548872.
  54. Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016.
  55. Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  5018–5027, 2017.
  56. Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019.
  57. Cross-domain few-shot meta-learning using stacking. arXiv preprint arXiv:2205.05831, 2022.
  58. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp.  23965–23998. PMLR, 2022a.
  59. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7959–7971, 2022b.
  60. Low-rank adaptation of large language model rescoring for parameter-efficient speech recognition. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.  1–8. IEEE, 2023.
  61. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  6023–6032, 2019.
  62. Learning useful representations for shifting tasks and distributions. In International Conference on Machine Learning, pp.  40830–40850. PMLR, 2023.
  63. Rich feature construction for the optimization-generalization dilemma. In International Conference on Machine Learning, pp.  26397–26411. PMLR, 2022.
  64. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  13001–13008, 2020.

Summary

  • The paper demonstrates that fine-tuning with dropout rates as high as 90% improves OOD generalization compared to ensemble methods.
  • It reveals that operating in a near-linear regime leverages pre-trained network features effectively, even under significant dropout.
  • The work highlights that high dropout during fine-tuning encourages feature diversity, offering a robust solution for distribution shifts.

Fine-tuning with Very Large Dropout

The paper "Fine-tuning with Very Large Dropout" addresses a significant challenge in machine learning, namely, the assumption that the distribution of training data matches the distribution of testing data. In practical scenarios, this assumption often falls short, necessitating techniques that continue to perform well when distributions shift. This work proposes using extremely high dropout rates during the fine-tuning phase of neural networks as a viable solution, challenging common practice and achieving superior out-of-distribution (OOD) generalization results compared to ensemble methods.

Methodological Advances

The paper presents a method of fine-tuning pre-trained deep networks with dropout rates as high as 90%, a level typically deemed unfeasible for training models from scratch. This method is grounded in the observation that while training from scratch with such high dropout can stall learning, fine-tuning operates effectively in a near-linear regime. This linearity allows existing features within a network to be leveraged without the necessity of creating new features, positioning dropout as a form of regularization conducive to discovering "rich" representations which are advantageous under distribution shifts.

Empirical Results

Through comprehensive experimentation using domain adaptation datasets (PACS, VLCS, OfficeHome, and TerraIncognita), the paper demonstrates that fine-tuning with high dropout rates outperforms both ensemble techniques and weight averaging in terms of OOD performance. The results suggest that even the worst high dropout configurations surpass the best ensemble results in certain datasets. Interestingly, while the in-distribution (IID) performance of the proposed method lags behind ensembles, its superiority in terms of OOD generalization highlights the strategic importance of including diverse, though possibly redundant, features from various network layers.

Theoretical Insights

The work provides insights into why fine-tuning strategies with large dropout are effective. It argues that standard neural network training induces an implicit sparsity bias via stochastic gradient descent, which can overlook features that might prove beneficial under different distributions. High dropout rates effectively regularize the network to explore these features by significantly expanding the expressive capacity of the model's representations, thus enhancing robustness against distributional changes.

Broader Implications and Future Directions

The paper opens up several intriguing implications for future research. Primarily, it suggests revisiting the fine-tuning procedures to consider dropout not merely as a regularization technique but as a means to enhance feature diversity. This could inspire new paradigms focusing on dropout as a tool for achieving domain generalization in neural networks.

Furthermore, the paper points out the reliance on the quality of the pre-trained models, implying that advancements in model pre-training (e.g., through richer and more diverse datasets) might further improve the efficacy of dropout during fine-tuning. Additionally, investigating the interactions between dropout and other regularization techniques could yield hybrid strategies for enhanced OOD performance.

In summary, this paper provides a compelling case for re-evaluating fine-tuning practices in neural networks by demonstrating how very large dropout rates can dramatically improve generalization capability across varying data distributions. While traditionally seen as risky, this approach reframes dropout as a powerful ally in the push towards more adaptable and resilient machine learning systems.