Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Evaluation of Out-of-Distribution Generalization (2403.01874v1)

Published 4 Mar 2024 in cs.LG

Abstract: Machine learning models, while progressively advanced, rely heavily on the IID assumption, which is often unfulfilled in practice due to inevitable distribution shifts. This renders them susceptible and untrustworthy for deployment in risk-sensitive applications. Such a significant problem has consequently spawned various branches of works dedicated to developing algorithms capable of Out-of-Distribution (OOD) generalization. Despite these efforts, much less attention has been paid to the evaluation of OOD generalization, which is also a complex and fundamental problem. Its goal is not only to assess whether a model's OOD generalization capability is strong or not, but also to evaluate where a model generalizes well or poorly. This entails characterizing the types of distribution shifts that a model can effectively address, and identifying the safe and risky input regions given a model. This paper serves as the first effort to conduct a comprehensive review of OOD evaluation. We categorize existing research into three paradigms: OOD performance testing, OOD performance prediction, and OOD intrinsic property characterization, according to the availability of test data. Additionally, we briefly discuss OOD evaluation in the context of pretrained models. In closing, we propose several promising directions for future research in OOD evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (147)
  1. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  2. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  3. X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural collaborative filtering,” in Proceedings of the 26th international conference on world wide web, 2017, pp. 173–182.
  4. R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” in Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, 2015, pp. 1310–1321.
  5. Q.-s. Zhang and S.-C. Zhu, “Visual interpretability for deep learning: a survey,” Frontiers of Information Technology & Electronic Engineering, vol. 19, no. 1, pp. 27–39, 2018.
  6. N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning in computer vision: A survey,” Ieee Access, vol. 6, pp. 14 410–14 430, 2018.
  7. J. Liu, Z. Shen, Y. He, X. Zhang, R. Xu, H. Yu, and P. Cui, “Towards out-of-distribution generalization: A survey,” arXiv preprint arXiv:2108.13624, 2023.
  8. R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth, “Fairness in criminal justice risk assessments: The state of the art,” Sociological Methods & Research, vol. 50, no. 1, pp. 3–44, 2021.
  9. M. Kukar, “Transductive reliability estimation for medical diagnosis,” Artificial Intelligence in Medicine, vol. 29, no. 1-2, pp. 81–106, 2003.
  10. B. Huval, T. Wang, S. Tandon, J. Kiske, W. Song, J. Pazhayampallil, M. Andriluka, P. Rajpurkar, T. Migimatsu, R. Cheng-Yue et al., “An empirical evaluation of deep learning on highway driving,” arXiv preprint arXiv:1504.01716, 2015.
  11. D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales, “Deeper, broader and artier domain generalization,” in ICCV, 2017, pp. 5542–5550.
  12. F. Ding, M. Hardt, J. Miller, and L. Schmidt, “Retiring adult: New datasets for fair machine learning,” Advances in neural information processing systems, vol. 34, pp. 6478–6490, 2021.
  13. S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Machine learning, vol. 79, no. 1-2, pp. 151–175, 2010.
  14. M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” in International conference on machine learning.   PMLR, 2015, pp. 97–105.
  15. M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Deep transfer learning with joint adaptation networks,” in International conference on machine learning.   PMLR, 2017, pp. 2208–2217.
  16. J. Wang, C. Lan, C. Liu, Y. Ouyang, T. Qin, W. Lu, Y. Chen, W. Zeng, and P. Yu, “Generalizing to unseen domains: A survey on domain generalization,” IEEE Transactions on Knowledge and Data Engineering, 2022.
  17. K. Zhou, Z. Liu, Y. Qiao, T. Xiang, and C. C. Loy, “Domain generalization: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  18. I. Gulrajani and D. Lopez-Paz, “In search of lost domain generalization,” in International Conference on Learning Representations, 2020.
  19. H. Namkoong and J. C. Duchi, “Stochastic gradient methods for distributionally robust optimization with f-divergences,” Advances in neural information processing systems, vol. 29, 2016.
  20. A. Sinha, H. Namkoong, and J. Duchi, “Certifying some distributional robustness with principled adversarial training,” in International Conference on Learning Representations, 2018.
  21. J. C. Duchi and H. Namkoong, “Learning models with uniform performance via distributionally robust optimization,” The Annals of Statistics, vol. 49, no. 3, pp. 1378–1406, 2021.
  22. M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz, “Invariant risk minimization,” arXiv preprint arXiv:1907.02893, 2019.
  23. J. Liu, Z. Hu, P. Cui, B. Li, and Z. Shen, “Heterogeneous risk minimization,” in International Conference on Machine Learning.   PMLR, 2021, pp. 6804–6814.
  24. ——, “Kernelized heterogeneous risk minimization,” arXiv preprint arXiv:2110.12425, 2021.
  25. Z. Shen, P. Cui, T. Zhang, and K. Kunag, “Stable learning via sample reweighting,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5692–5699.
  26. K. Kuang, R. Xiong, P. Cui, S. Athey, and B. Li, “Stable prediction with model misspecification and agnostic distribution shift,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 4485–4492.
  27. H. Yu, P. Cui, Y. He, Z. Shen, Y. Lin, R. Xu, and X. Zhang, “Stable learning via sparse variable independence,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, 2023, pp. 10 998–11 006.
  28. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR.   Ieee, 2009, pp. 248–255.
  29. X. Zhang, Y. He, R. Xu, H. Yu, Z. Shen, and P. Cui, “Nico++: Towards better benchmarking for domain generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16 036–16 047.
  30. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  31. N. Ye, K. Li, H. Bai, R. Yu, L. Hong, F. Zhou, Z. Li, and J. Zhu, “Ood-bench: Quantifying and understanding two dimensions of out-of-distribution generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7947–7958.
  32. T. T. Cai, H. Namkoong, S. Yadlowsky et al., “Diagnosing model performance under distribution shift.” arXiv preprint arXiv:2303.02011, 2023.
  33. J. Liu, T. Wang, P. Cui, and H. Namkoong, “On the need for a language describing distribution shifts: Illustrations on tabular datasets,” arXiv preprint arXiv:2307.05284, 2023.
  34. W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
  35. J. Lee, C. Liu, J. Kim, Z. Chen, Y. Sun, J. R. Rogers, W. K. Chung, and C. Weng, “Deep learning for rare disease: A scoping review,” Journal of Biomedical Informatics, p. 104227, 2022.
  36. H. Yu, X. Zhang, R. Xu, J. Liu, Y. He, and P. Cui, “Rethinking the evaluation protocol of domain generalization,” arXiv preprint arXiv:2305.15253, 2023.
  37. X. Li, M. Liu, S. Gao, and W. Buntine, “A survey on out-of-distribution evaluation of neural nlp models,” arXiv preprint arXiv:2306.15261, 2023.
  38. V. Vapnik, “Principles of risk minimization for learning theory,” in NeurIPS 4, vol. 4, 1991, pp. 831–838.
  39. A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in CVPR 2011.   IEEE, 2011, pp. 1521–1528.
  40. D. C. Castro, I. Walker, and B. Glocker, “Causality matters in medical imaging,” Nature Communications, vol. 11, no. 1, p. 3673, 2020.
  41. S. Garg, S. Balakrishnan, Z. C. Lipton, B. Neyshabur, and H. Sedghi, “Leveraging unlabeled data to predict out-of-distribution performance,” in International Conference on Learning Representations, 2021.
  42. W. Deng and L. Zheng, “Are labels always necessary for classifier accuracy evaluation?” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 069–15 078.
  43. C. Baek, Y. Jiang, A. Raghunathan, and J. Z. Kolter, “Agreement-on-the-line: Predicting the performance of neural networks under distribution shift,” Advances in Neural Information Processing Systems, vol. 35, pp. 19 274–19 289, 2022.
  44. J. N. Kaur, E. Kiciman, and A. Sharma, “Modeling the data-generating process is necessary for out-of-distribution generalization,” arXiv preprint arXiv:2206.07837, 2022.
  45. M. Li, H. Namkoong, and S. Xia, “Evaluating model performance under worst-case subpopulations,” Advances in Neural Information Processing Systems, vol. 34, pp. 17 325–17 334, 2021.
  46. S. Gupta and D. Rothenhäusler, “The s𝑠sitalic_s-value: evaluating stability with respect to distributional shifts,” arXiv preprint arXiv:2105.03067, 2021.
  47. H. Ye, C. Xie, T. Cai, R. Li, Z. Li, and L. Wang, “Towards a theoretical framework of out-of-distribution generalization,” Advances in Neural Information Processing Systems, vol. 34, pp. 23 519–23 531, 2021.
  48. X. Zhang, R. Xu, H. Yu, Y. Dong, P. Tian, and P. Cu, “Flatness-aware minimization for domain generalization,” arXiv preprint arXiv:2307.11108, 2023.
  49. J. C. Duchi, T. Hashimoto, and H. Namkoong, “Distributionally robust losses against mixture covariate shifts,” Under review, vol. 2, no. 1, 2019.
  50. P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” in International Conference on Learning Representations, 2020.
  51. L. G. Neuberg, “Causality: models, reasoning, and inference, by judea pearl, cambridge university press, 2000,” Econometric Theory, vol. 19, no. 4, pp. 675–685, 2003.
  52. H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan, “Deep hashing network for unsupervised domain adaptation,” in CVPR, 2017, pp. 5018–5027.
  53. X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang, “Moment matching for multi-source domain adaptation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1406–1415.
  54. C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011.
  55. W. Liang and J. Zou, “Metashift: A dataset of datasets for evaluating contextual distribution shifts and training conflicts,” in International Conference on Learning Representations, 2021.
  56. K. Xiao, L. Engstrom, A. Ilyas, and A. Madry, “Noise or signal: The role of image backgrounds in object recognition,” arXiv preprint arXiv:2006.09994, 2020.
  57. S. Beery, G. Van Horn, and P. Perona, “Recognition in terra incognita,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 456–473.
  58. S. Beery, A. Agarwal, E. Cole, and V. Birodkar, “The iwildcam 2021 competition dataset,” arXiv preprint arXiv:2105.03494, 2021.
  59. D. Hendrycks and T. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,” in International Conference on Learning Representations, 2018.
  60. A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba, “Undoing the damage of dataset bias,” in European Conference on Computer Vision.   Springer, 2012, pp. 158–171.
  61. A. M. Puli, L. H. Zhang, E. K. Oermann, and R. Ranganath, “Out-of-distribution generalization in the presence of nuisance-induced spurious correlations,” in International Conference on Learning Representations, 2021.
  62. Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 3730–3738.
  63. A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng, “Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs,” arXiv preprint arXiv:1901.07042, 2019.
  64. J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya et al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 590–597.
  65. D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo et al., “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8340–8349.
  66. A. Lynch, G. J. Dovonon, J. Kaddour, and R. Silva, “Spawrious: A benchmark for fine control of spurious correlation biases,” arXiv preprint arXiv:2303.05470, 2023.
  67. Y. Yang, H. Zhang, D. Katabi, and M. Ghassemi, “Change is hard: A closer look at subpopulation shift,” arXiv preprint arXiv:2302.12254, 2023.
  68. A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz, “Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models,” Advances in neural information processing systems, vol. 32, 2019.
  69. D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 262–15 271.
  70. D. Hendrycks, S. Basart, M. Mazeika, A. Zou, J. Kwon, M. Mostajabi, J. Steinhardt, and D. Song, “Scaling out-of-distribution detection for real-world settings,” in International Conference on Machine Learning.   PMLR, 2022, pp. 8759–8773.
  71. P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gao et al., “Wilds: A benchmark of in-the-wild distribution shifts,” in International Conference on Machine Learning.   PMLR, 2021, pp. 5637–5664.
  72. X. Mao, Y. Chen, Y. Zhu, D. Chen, H. Su, R. Zhang, and H. Xue, “Coco-o: A benchmark for object detectors under natural distribution shifts,” arXiv preprint arXiv:2307.12730, 2023.
  73. G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3234–3243.
  74. S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14.   Springer, 2016, pp. 102–118.
  75. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213–3223.
  76. A. Williams, N. Nangia, and S. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018, pp. 1112–1122.
  77. D. Hendrycks, X. Liu, E. Wallace, A. Dziedzic, R. Krishnan, and D. Song, “Pretrained transformers improve out-of-distribution robustness,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2744–2751.
  78. R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631–1642.
  79. A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 2011, pp. 142–150.
  80. L. Yang, S. Zhang, L. Qin, Y. Li, Y. Wang, H. Liu, J. Wang, X. Xie, and Y. Zhang, “Glue-x: Evaluating natural language understanding models from an out-of-distribution generalization perspective,” arXiv preprint arXiv:2211.08073, 2022.
  81. L. Yuan, Y. Chen, G. Cui, H. Gao, F. Zou, X. Cheng, H. Ji, Z. Liu, and M. Sun, “Revisiting out-of-distribution robustness in nlp: Benchmark, analysis, and llms evaluations,” arXiv preprint arXiv:2306.04618, 2023.
  82. Z. Shen, P. Cui, J. Liu, T. Zhang, B. Li, and Z. Chen, “Stable learning via differentiated variable decorrelation,” in Proceedings of the 26th acm sigkdd international conference on knowledge discovery & data mining, 2020, pp. 2185–2193.
  83. D. Anna Montoya, “House prices - advanced regression techniques,” 2016. [Online]. Available: https://kaggle.com/competitions/house-prices-advanced-regression-techniques
  84. M. Risdal, “New york city taxi trip duration,” 2017. [Online]. Available: https://kaggle.com/competitions/nyc-taxi-trip-duration
  85. S. Moosavi, M. H. Samavatian, S. Parthasarathy, and R. Ramnath, “A countrywide traffic accident dataset,” arXiv preprint arXiv:1906.05409, 2019.
  86. R. Kohavi et al., “Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid.” in Kdd, vol. 96, 1996, pp. 202–207.
  87. S. Jesus, J. Pombal, D. Alves, A. Cruz, P. Saleiro, R. Ribeiro, J. Gama, and P. Bizarro, “Turning the tables: Biased, imbalanced, dynamic tabular datasets for ml evaluation,” Advances in Neural Information Processing Systems, vol. 35, pp. 33 563–33 575, 2022.
  88. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
  89. B. Y. Idrissi, M. Arjovsky, M. Pezeshki, and D. Lopez-Paz, “Simple data balancing achieves competitive worst-group-accuracy,” in Conference on Causal Learning and Reasoning.   PMLR, 2022, pp. 336–351.
  90. S. Santurkar, D. Tsipras, and A. Madry, “Breeds: Benchmarks for subpopulation shift,” arXiv preprint arXiv:2008.04859, 2020.
  91. T. Gao, X. Yao, and D. Chen, “Simcse: Simple contrastive learning of sentence embeddings,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 6894–6910.
  92. B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do imagenet classifiers generalize to imagenet?” in International conference on machine learning.   PMLR, 2019, pp. 5389–5400.
  93. K. Budhathoki, D. Janzing, P. Bloebaum, and H. Ng, “Why did the distribution change?” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2021, pp. 1666–1674.
  94. H. Zhang, H. Singh, M. Ghassemi, and S. Joshi, “” why did the model fail?”: Attributing model performance changes to distribution shifts,” arXiv preprint arXiv:2210.10769, 2022.
  95. S. Kulinski and D. I. Inouye, “Towards explaining distribution shifts,” in International Conference on Machine Learning.   PMLR, 2023, pp. 17 931–17 952.
  96. D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” in International Conference on Learning Representations, 2016.
  97. N. Ng, K. Cho, N. Hulkund, and M. Ghassemi, “Predicting out-of-domain generalization with local manifold smoothness,” arXiv preprint arXiv:2207.02093, 2022.
  98. D. Guillory, V. Shankar, S. Ebrahimi, T. Darrell, and L. Schmidt, “Predicting with confidence on unseen distributions,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1134–1144.
  99. W. Deng, Y. Suh, S. Gould, and L. Zheng, “Confidence and dispersity speak: Characterising prediction matrix for unsupervised accuracy estimation,” arXiv preprint arXiv:2302.01094, 2023.
  100. W. Deng, S. Gould, and L. Zheng, “What does rotation prediction tell us about classifier accuracy under varying testing environments?” in International Conference on Machine Learning.   PMLR, 2021, pp. 2579–2589.
  101. Y. Liu, C. X. Tian, H. Li, L. Ma, and S. Wang, “Neuron activation coverage: Rethinking out-of-distribution detection and generalization,” arXiv preprint arXiv:2306.02879, 2023.
  102. S. Sagawa, P. W. Koh, T. B. Hashimoto, and P. Liang, “Distributionally robust neural networks,” in International Conference on Learning Representations, 2019.
  103. Y. He, Z. Shen, and P. Cui, “Towards non-iid image classification: A dataset and baselines,” Pattern Recognition, vol. 110, p. 107383, 2021.
  104. Y. Yu, Z. Yang, A. Wei, Y. Ma, and J. Steinhardt, “Predicting out-of-distribution error with the projection norm,” in International Conference on Machine Learning.   PMLR, 2022, pp. 25 721–25 746.
  105. C.-Y. Chuang, A. Torralba, and S. Jegelka, “Estimating generalization under distribution shifts via domain-invariant representations,” in International Conference on Machine Learning.   PMLR, 2020, pp. 1984–1994.
  106. Y. Lu, Y. Qin, R. Zhai, A. Shen, K. Chen, Z. Wang, S. Kolouri, S. Stepputtis, J. Campbell, and K. Sycara, “Characterizing out-of-distribution error via optimal transport,” arXiv preprint arXiv:2305.15640, 2023.
  107. O. Madani, D. Pennock, and G. Flake, “Co-validation: Using model disagreement on unlabeled data to validate classification algorithms,” Advances in neural information processing systems, vol. 17, 2004.
  108. P. Nakkiran and Y. Bansal, “Distributional generalization: A new kind of generalization,” arXiv preprint arXiv:2009.08092, 2020.
  109. Y. Jiang, V. Nagarajan, C. Baek, and J. Z. Kolter, “Assessing generalization of sgd via disagreement,” in International Conference on Learning Representations, 2021.
  110. J. P. Miller, R. Taori, A. Raghunathan, S. Sagawa, P. W. Koh, V. Shankar, P. Liang, Y. Carmon, and L. Schmidt, “Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization,” in International Conference on Machine Learning.   PMLR, 2021, pp. 7721–7735.
  111. J. Chen, F. Liu, B. Avci, X. Wu, Y. Liang, and S. Jha, “Detecting errors and estimating accuracy on unlabeled data with self-training ensembles,” Advances in Neural Information Processing Systems, vol. 34, pp. 14 980–14 992, 2021.
  112. A. Kirsch and Y. Gal, “A note on” assessing generalization of sgd via disagreement”,” Transactions on Machine Learning Research, 2022.
  113. Y. Jiang, C. Baek, and J. Z. Kolter, “On the joint interaction of models, data, and features,” arXiv preprint arXiv:2306.04793, 2023.
  114. P. Trivedi, D. Koutra, and J. J. Thiagarajan, “A closer look at scoring functions and generalization prediction,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  115. L. Chen, M. Zaharia, and J. Y. Zou, “Is unsupervised performance estimation impossible when both covariates and labels shift?” in NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, 2022.
  116. A. Subbaswamy, R. Adams, and S. Saria, “Evaluating model robustness and stability to dataset shift,” in International conference on artificial intelligence and statistics.   PMLR, 2021, pp. 2611–2619.
  117. N. Thams, M. Oberst, and D. Sontag, “Evaluating robustness to dataset shift via parametric robustness sets,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 877–16 889, 2022.
  118. H. Namkoong, Y. Ma, and P. W. Glynn, “Minimax optimal estimation of stability under distribution shift,” arXiv preprint arXiv:2212.06338, 2022.
  119. J. Peters, P. Bühlmann, and N. Meinshausen, “Causal inference by using invariant prediction: identification and confidence intervals,” Journal of the Royal Statistical Society. Series B (Statistical Methodology), vol. 78, no. 5, pp. 947–1012, 2016. [Online]. Available: http://www.jstor.org/stable/44682904
  120. Y. Wald, A. Feder, D. Greenfeld, and U. Shalit, “On calibration and out-of-domain generalization,” Advances in neural information processing systems, vol. 34, pp. 2215–2227, 2021.
  121. J. Liu, J. Wu, R. Pi, R. Xu, X. Zhang, B. Li, and P. Cui, “Measure the predictive heterogeneity,” in The Eleventh International Conference on Learning Representations, 2022.
  122. N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” in International Conference on Learning Representations, 2016.
  123. Z. Jia and H. Su, “Information-theoretic local minima characterization and regularization,” in International Conference on Machine Learning.   PMLR, 2020, pp. 4773–4783.
  124. P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson, “Averaging weights leads to wider optima and better generalization,” arXiv preprint arXiv:1803.05407, 2018.
  125. X. Zhang, R. Xu, H. Yu, H. Zou, and P. Cui, “Gradient norm aware minimization seeks first-order flatness and improves generalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 247–20 257.
  126. P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina, “Entropy-sgd: Biasing gradient descent into wide valleys,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2019, no. 12, p. 124018, 2019.
  127. A. Lewkowycz, Y. Bahri, E. Dyer, J. Sohl-Dickstein, and G. Gur-Ari, “The large learning rate phase of deep learning: the catapult mechanism,” arXiv preprint arXiv:2003.02218, 2020.
  128. J. Cha, S. Chun, K. Lee, H.-C. Cho, S. Park, Y. Lee, and S. Park, “Swad: Domain generalization by seeking flat minima,” Advances in Neural Information Processing Systems, vol. 34, pp. 22 405–22 418, 2021.
  129. D. Arpit, H. Wang, Y. Zhou, and C. Xiong, “Ensemble of averages: Improving model selection and boosting performance in domain generalization,” Advances in Neural Information Processing Systems, vol. 35, pp. 8265–8277, 2022.
  130. X. Chu, Y. Jin, W. Zhu, Y. Wang, X. Wang, S. Zhang, and H. Mei, “Dna: Domain generalization with diversified neural averaging,” in International Conference on Machine Learning.   PMLR, 2022, pp. 4010–4034.
  131. M. Wortsman, G. Ilharco, S. Y. Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y. Carmon, S. Kornblith et al., “Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,” in International Conference on Machine Learning.   PMLR, 2022, pp. 23 965–23 998.
  132. A. Rame, K. Ahuja, J. Zhang, M. Cord, L. Bottou, and D. Lopez-Paz, “Model ratatouille: Recycling diverse models for out-of-distribution generalization,” 2023.
  133. A. Rame, M. Kirchmeyer, T. Rahier, A. Rakotomamonjy, P. Gallinari, and M. Cord, “Diverse weight averaging for out-of-distribution generalization,” Advances in Neural Information Processing Systems, vol. 35, pp. 10 821–10 836, 2022.
  134. Z. Li, K. Ren, X. Jiang, Y. Shen, H. Zhang, and D. Li, “Simple: Specialized model-sample matching for domain generalization,” in The Eleventh International Conference on Learning Representations, 2022.
  135. M. Andriushchenko, F. Croce, M. Müller, M. Hein, and N. Flammarion, “A modern look at the relationship between sharpness and generalization,” arXiv preprint arXiv:2302.07011, 2023.
  136. N. Meinshausen, “Causality from a distributional robustness point of view,” in 2018 IEEE Data Science Workshop (DSW).   IEEE, 2018, pp. 6–10.
  137. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  138. S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
  139. Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen, L. Yang, X. Yi, C. Wang, Y. Wang et al., “A survey on evaluation of large language models,” arXiv preprint arXiv:2307.03109, 2023.
  140. A. Kumar, A. Raghunathan, R. M. Jones, T. Ma, and P. Liang, “Fine-tuning can distort pretrained features and underperform out-of-distribution,” in International Conference on Learning Representations, 2021.
  141. M. Mazumder, C. Banbury, X. Yao, B. Karlaš, W. G. Rojas, S. Diamos, G. Diamos, L. He, D. Kiela, D. Jurado et al., “Dataperf: Benchmarks for data-centric ai development,” arXiv preprint arXiv:2207.10062, 2022.
  142. X. Zhang, Y. He, T. Wang, J. Qi, H. Yu, Z. Wang, J. Peng, R. Xu, Z. Shen, Y. Niu et al., “Nico challenge: Out-of-distribution generalization for image recognition challenges,” in European Conference on Computer Vision.   Springer, 2022, pp. 433–450.
  143. D. Zha, Z. P. Bhat, K.-H. Lai, F. Yang, Z. Jiang, S. Zhong, and X. Hu, “Data-centric artificial intelligence: A survey,” arXiv preprint arXiv:2303.10158, 2023.
  144. Z. Shen, H. Yu, P. Cui, J. Liu, X. Zhang, L. Zhou, and F. Liu, “Meta adaptive task sampling for few-domain generalization,” arXiv preprint arXiv:2305.15644, 2023.
  145. R. Taori, A. Dave, V. Shankar, N. Carlini, B. Recht, and L. Schmidt, “When robustness doesn’t promote robustness: Synthetic vs. natural distribution shifts on imagenet,” 2019.
  146. H. Mania and S. Sra, “Why do classifier accuracies show linear trends under distribution shift?” arXiv preprint arXiv:2012.15483, 2020.
  147. A. J. Andreassen, Y. Bahri, B. Neyshabur, and R. Roelofs, “The evolution of out-of-distribution robustness throughout fine-tuning,” Transactions on Machine Learning Research, 2022.
Citations (7)

Summary

We haven't generated a summary for this paper yet.