Fine-tuning with Very Large Dropout (2403.00946v3)
Abstract: It is impossible today to pretend that the practice of machine learning is always compatible with the idea that training and testing data follow the same distribution. Several authors have recently used ensemble techniques to show how scenarios involving multiple data distributions are best served by representations that are both richer than those obtained by regularizing for the best in-distribution performance, and richer than those obtained under the influence of the implicit sparsity bias of common stochastic gradient procedures. This contribution investigates the use of very high dropout rates instead of ensembles to obtain such rich representations. Although training a deep network from scratch using such dropout rates is virtually impossible, fine-tuning a large pre-trained model under such conditions is not only possible but also achieves out-of-distribution performances that exceed those of both ensembles and weight averaging methods such as model soups. This result has practical significance because the importance of the fine-tuning scenario has considerably grown in recent years. This result also provides interesting insights on the nature of rich representations and on the intrinsically linear nature of fine-tuning a large network using a comparatively small dataset.
- Sgd with large step sizes learns sparse features. In International Conference on Machine Learning, pp. 903–925. PMLR, 2023.
- Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
- Ensemble of averages: Improving model selection and boosting performance in domain generalization. Advances in Neural Information Processing Systems, 35:8265–8277, 2022.
- Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019.
- Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
- Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV), pp. 456–473, 2018.
- Universal representations: The missing link between faces, text, planktons, and cat breeds. arXiv preprint arXiv:1701.07275, 2017.
- Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on learning theory, pp. 483–513. PMLR, 2020.
- On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021a. URL https://arxiv.org/abs/2108.07258.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021b.
- Bottou, L. From machine learning to machine reasoning. Technical report, arXiv:1102.1808, February 2011.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
- Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34:22405–22418, 2021.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020.
- Towards understanding feature learning in out-of-distribution generalization. arXiv preprint arXiv:2304.11327, 2023.
- On lazy training in differentiable programming, 2020.
- Few-shot image classification: Just use a library of pre-trained feature extractors and a simple classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9445–9454, 2021.
- Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493–2537, Aug 2011.
- Dietterich, T. G. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Springer, 2000.
- Selecting relevant features from a multi-domain representation for few-shot classification. In European Conference on Computer Vision, pp. 769–786. Springer, 2020.
- Head2toe: Utilizing intermediate representations for better transfer learning. In International Conference on Machine Learning, pp. 6009–6033. PMLR, 2022.
- Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1657–1664, 2013.
- Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. PMLR, 2020.
- No one representation to rule them all: Overlapping features of training methods. arXiv preprint arXiv:2110.12899, 2021.
- In search of lost domain generalization. arXiv preprint arXiv:2007.01434, 2020.
- Control batch size and learning rate to generalize well: Theoretical and empirical evidence. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/dc6a70712a252123c40d2adba6a11d84-Paper.pdf.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349, 2021a.
- Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262–15271, 2021b.
- Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp. 646–661. Springer, 2016.
- Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
- Exploiting generative models in discriminative classifiers. Advances in neural information processing systems, 11, 1998.
- On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
- Out-of-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, pp. 5815–5826. PMLR, 2021.
- Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054, 2022.
- Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp. 5542–5550, 2017.
- Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018a.
- Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018b.
- Universal representation learning from multiple domains for few-shot classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9526–9535, 2021.
- Universal representations: A unified look at multiple task and domain learning. arXiv preprint arXiv:2204.02744, 2022.
- Fast adaptation with linearized neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 2737–2745. PMLR, 2021.
- Gradients as features for deep representation learning. In International Conference on Learning Representations, 2019.
- Trivialaugment: Tuning-free yet state-of-the-art data augmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 774–782, 2021.
- Path-sgd: Path-normalized optimization in deep neural networks, 2015.
- Moment matching for multi-source domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1406–1415, 2019.
- Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. doi: 10.1137/0330046.
- Recycling diverse models for out-of-distribution generalization. arXiv preprint arXiv:2212.10445, 2022a.
- Diverse weight averaging for out-of-distribution generalization. Advances in Neural Information Processing Systems, 35:10821–10836, 2022b.
- Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pp. 5389–5400. PMLR, 2019.
- Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 806–813, 2014.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- Deep coral: Correlation alignment for deep domain adaptation. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14, pp. 443–450. Springer, 2016.
- Generalization error of ensemble estimators. In Proceedings of International Conference on Neural Networks (ICNN’96), volume 1, pp. 90–95 vol.1, 1996. doi: 10.1109/ICNN.1996.548872.
- Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016.
- Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5018–5027, 2017.
- Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019.
- Cross-domain few-shot meta-learning using stacking. arXiv preprint arXiv:2205.05831, 2022.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022a.
- Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7959–7971, 2022b.
- Low-rank adaptation of large language model rescoring for parameter-efficient speech recognition. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–8. IEEE, 2023.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023–6032, 2019.
- Learning useful representations for shifting tasks and distributions. In International Conference on Machine Learning, pp. 40830–40850. PMLR, 2023.
- Rich feature construction for the optimization-generalization dilemma. In International Conference on Machine Learning, pp. 26397–26411. PMLR, 2022.
- Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 13001–13008, 2020.