An Empirical Study of Pre-trained Model Selection for Out-of-Distribution Generalization and Calibration (2307.08187v3)
Abstract: In out-of-distribution (OOD) generalization tasks, fine-tuning pre-trained models has become a prevalent strategy. Different from most prior work that has focused on advancing learning algorithms, we systematically examined how pre-trained model size, pre-training dataset size, and training strategies impact generalization and uncertainty calibration on downstream tasks. We evaluated 100 models across diverse pre-trained model sizes, \update{five} pre-training datasets, and five data augmentations through extensive experiments on four distribution shift datasets totaling over 120,000 GPU hours. Our results demonstrate the significant impact of pre-trained model selection, with optimal choices substantially improving OOD accuracy over algorithm improvement alone. We find larger models and bigger pre-training data improve OOD performance and calibration, in contrast to some prior studies that found modern deep networks to calibrate worse than classical shallow models. Our work underscores the overlooked importance of pre-trained model selection for out-of-distribution generalization and calibration.
- Empirical or invariant risk minimization? a sample complexity perspective. arXiv preprint arXiv:2010.16412, 2020.
- Invariance principle meets information bottleneck for out-of-distribution generalization. Advances in Neural Information Processing Systems, 34:3438–3450, 2021.
- Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019a.
- Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019b.
- Ensemble of averages: Improving model selection and boosting performance in domain generalization. Advances in Neural Information Processing Systems, 35:8265–8277, 2022.
- Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020.
- Domain generalization by marginal transfer learning. The Journal of Machine Learning Research, 22(1):46–100, 2021.
- Domain generalization by solving jigsaw puzzles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2229–2238, 2019.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Domain generalization by mutual-information regularization with pre-trained models. In European Conference on Computer Vision, pages 440–457. Springer, 2022.
- Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 113–123, 2019.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Unbiased metric learning: On the utilization of multiple datasets and web images for softening bias. In Proceedings of the IEEE International Conference on Computer Vision, pages 1657–1664, 2013.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021.
- Woods: Benchmarks for out-of-distribution generalization in time series. arXiv preprint arXiv:2203.09978, 2022.
- Benchmarking distribution shift in tabular data with tableshift. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
- Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE international conference on computer vision, pages 2551–2559, 2015.
- Battle of the backbones: A large-scale comparison of pretrained models across computer v arXiv preprint arXiv:2310.19909, 2023.
- In search of lost domain generalization. In International Conference on Learning Representations, 2021.
- On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019.
- Augmix: A simple method to improve robustness and uncertainty under data shift. In International Conference on Learning Representations, 2020.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021a.
- Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021b.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
- Scalable marginal likelihood estimation for model selection in deep learning. In International Conference on Machine Learning, pages 4563–4573. PMLR, 2021.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637–5664. PMLR, 2021.
- When is invariance useful in an out-of-distribution generalization problem? arXiv preprint arXiv:2008.01883, 2020.
- Improving model calibration with accuracy versus uncertainty optimization. Advances in Neural Information Processing Systems, 33:18237–18248, 2020.
- Out-of-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, pages 5815–5826. PMLR, 2021.
- Sparse mixture-of-experts are domain generalizable learners. arXiv preprint arXiv:2206.04046, 2022.
- Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pages 5542–5550, 2017.
- A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- A scaling law for syn2real transfer: How much is your pre-training effective? In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 477–492. Springer, 2022.
- Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, 2015.
- Empirical study on optimizer selection for out-of-distribution generalization. Transactions on Machine Learning Research, 2023.
- Understanding the failure modes of out-of-distribution generalization. In International Conference on Learning Representations, 2021.
- Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32, 2019.
- Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1406–1415, 2019.
- Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10428–10436, 2020.
- Fishr: Invariant gradient variances for out-of-distribution generalization. In International Conference on Machine Learning, pages 18347–18377. PMLR, 2022.
- Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR, 2019.
- Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731, 2019.
- LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
- Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624, 2021.
- Very deep convolutional networks for large-scale image recognition. In International Conference on Learning representations, 2015.
- How to train your vit? data, augmentation, and regularization in vision transformers. Transactions on Machine Learning Research, 2022.
- Deep coral: Correlation alignment for deep domain adaptation. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14, pages 443–450. Springer, 2016.
- Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843–852, 2017.
- How image corruption and perturbation affect out-of-distribution generalization and calibration. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 1–6, 2023.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
- Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
- Vladimir Vapnik. Principles of risk minimization for learning theory. Advances in neural information processing systems, 4, 1991.
- Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027, 2017.
- On calibration and out-of-domain generalization. Advances in neural information processing systems, 34:2215–2227, 2021.
- Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
- Resnet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476, 2021.
- Adversarial examples improve image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 819–828, 2020a.
- Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020b.
- Improve unsupervised domain adaptation with mixup training. arXiv preprint arXiv:2001.00677, 2020.
- Syn2real transfer learning for image deraining using gaussian processes. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2726–2736, 2020.
- Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
- Delving deep into the generalization of vision transformers under distribution shifts. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 7277–7286, 2022.
- Towards unified and effective domain generalization, 2023.
- Domain generalization: A survey. arXiv preprint arXiv:2103.02503, 2021.
- Hiroki Naganuma (10 papers)
- Ryuichiro Hataya (18 papers)
- Ioannis Mitliagkas (61 papers)