Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Diagnosing Model Performance Under Distribution Shift (2303.02011v4)

Published 3 Mar 2023 in stat.ML and cs.LG

Abstract: Prediction models can perform poorly when deployed to target distributions different from the training distribution. To understand these operational failure modes, we develop a method, called DIstribution Shift DEcomposition (DISDE), to attribute a drop in performance to different types of distribution shifts. Our approach decomposes the performance drop into terms for 1) an increase in harder but frequently seen examples from training, 2) changes in the relationship between features and outcomes, and 3) poor performance on examples infrequent or unseen during training. These terms are defined by fixing a distribution on $X$ while varying the conditional distribution of $Y \mid X$ between training and target, or by fixing the conditional distribution of $Y \mid X$ while varying the distribution on $X$. In order to do this, we define a hypothetical distribution on $X$ consisting of values common in both training and target, over which it is easy to compare $Y \mid X$ and thus predictive performance. We estimate performance on this hypothetical distribution via reweighting methods. Empirically, we show how our method can 1) inform potential modeling improvements across distribution shifts for employment prediction on tabular census data, and 2) help to explain why certain domain adaptation methods fail to improve model performance for satellite image classification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (110)
  1. Automated essay scoring in the presence of biased ratings. In Association for Computational Linguistics (ACL), pages 229–237, 2018.
  2. Invariant risk minimization. arXiv:1907.02893 [stat.ML], 2019.
  3. S. Asmussen and P. W. Glynn. Stochsatic Simulation: Algorithms and Analysis. Springer, 2007.
  4. From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE transactions on medical imaging, 38(2):550–560, 2018.
  5. Six randomized evaluations of microcredit: Introduction and further steps. American Economic Journal: Applied Economics, 7(1):1–21, 2015.
  6. The iwildcam 2020 competition dataset. arXiv:2004.10340 [cs.CV], 2020.
  7. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems 20, pages 137–144, 2007.
  8. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341–357, 2013.
  9. Data-driven robust optimization. Mathematical Programming, Series A, 167(2):235–292, 2018.
  10. Efficient and Adaptive Estimation for Semiparametric Models. Springer Verlag, 1998.
  11. Discriminative learning for differing training and test distributions. In Proceedings of the 24th International Conference on Machine Learning, 2007.
  12. Data-driven optimal transport cost selection for distributionally robust optimizatio. arXiv:1705.07152 [stat.ML], 2017.
  13. Robust Wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857, 2019.
  14. J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, pages 77–91, 2018.
  15. Balancing vs modeling approaches to weighting in practice. Statistics in Medicine, 39(24), 2020. doi: https://doi.org/10.1002/sim.8659. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.8659.
  16. Ethical machine learning in health care. arXiv:2009.10576 [cs.SY], 2020.
  17. Twenty years post-NIH revitalization act: enhancing minority participation in clinical trials (EMPaCT): laying the groundwork for improving minority clinical trial accrual: renewing the case for enhancing minority participation in cancer clinical trials. Cancer, 120:1091–1096, 2014.
  18. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68, 2018.
  19. Functional map of the world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172–6180, 2018.
  20. E. Chung and J. P. Romano. Exact and asymptotically robust permutation tests. The Annals of Statistics, 41(2):484 – 507, 2013.
  21. A warm-start approach for large-scale stochastic linear programs. Mathematical Programming, 127(2):371–397, 2011.
  22. G. Cruces and S. Galiani. Fertility and female labor supply in latin america: New causal evidence. Labour Economics, 14(3):565–573, 2007.
  23. Moving the goalposts: Addressing limited overlap in the estimation of average treatment effects by changing the estimand. Technical report, National Bureau of Economic Research, 2006.
  24. From local to global: External validity in a fertility natural experiment. Journal of Business & Economic Statistics, 39(1):217–243, 2021.
  25. E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3):595–612, 2010.
  26. Retiring adult: New datasets for fair machine learning. Advances in Neural Information Processing Systems 34, 34, 2021.
  27. Distributionally robust losses for latent covariate mixtures. Operations Research, 2022.
  28. J. C. Duchi and H. Namkoong. Learning models with uniform performance via distributionally robust optimization. Annals of Statistics, 49(3):1378–1406, 2021.
  29. B. Efron. The Jackknife, the Bootstrap and Other Resampling Plans. Society for Industrial and Applied Mathematics, 1982. doi: 10.1137/1.9781611970319.
  30. N. Egami and E. Hartman. Elements of external validity: Framework, design, and analysis. American Political Science Review, page 1–19, 2022.
  31. Linear minimax regret estimation of deterministic parameters with bounded data uncertainties. IEEE Transactions on Signal Processing, 52(8):2177–2188, 2004.
  32. C. Elkan. The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, volume 17, pages 973–978. Lawrence Erlbaum Associates Ltd, 2001.
  33. C. B. Fogarty. Studentized sensitivity analysis for the sample average treatment effect in paired observational studies. Journal of the American Statistical Association, pages 1–13, 2019.
  34. Domain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1–35, 2016.
  35. Wasserstein distributional robustness and regularization in statistical learning. arXiv:1712.06050 [cs.LG], 2017.
  36. How does health promotion work? evidence from the dirty business of eliminating open defecation. Technical report, National Bureau of Economic Research, 2015.
  37. Propensity score models are better when post-calibrated. arXiv:2211.01221 [stat.ML], 2022. URL https://arxiv.org/abs/2211.01221.
  38. D. J. Hand. Classifier technology and the illusion of progress. Statistical Science, 21(1):1–14, 2006.
  39. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021.
  40. T. Homma and A. Saltelli. Importance measures in global sensitivity analysis of nonlinear models. Reliability Engineering & System Safety, 52(1):1–17, 1996.
  41. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems 20, pages 601–608, 2007.
  42. C. Huyen. Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications. O’Reilly, 2022.
  43. K. Imai and M. Ratkovic. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):243–263, 2014.
  44. G. Imbens and D. Rubin. Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, 2015.
  45. S. Jeong and H. Namkoong. Assessing external validity over worst-case subpopulations. arXiv:2007.02411 [stat.ML], 2020.
  46. N. Kallus and A. Zhou. Confounding-robust policy improvement. In Advances in Neural Information Processing Systems 31, pages 9269–9279, 2018.
  47. Interval estimation of individual-level causal effects under unobserved confounding. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
  48. E. H. Kennedy. Semiparametric doubly robust targeted double machine learning: a review. arXiv:2203.06469 [stat.ME], 2022.
  49. Assessing methods for generalizing experimental impact estimates to target populations. Journal of Research on Educational Effectiveness, 9(1):103–127, 2016.
  50. Wilds: A benchmark of in-the-wild distribution shifts. arXiv:2012.07421 [cs.LG], 2020.
  51. Wasserstein distributionally robust optimization: Theory and applications in machine learning. In Operations Research & Management Science in the Age of Analytics, pages 130–166. INFORMS, 2019.
  52. H. Lam. Recovering best statistical guarantees via the empirical divergence-based distributionally robust optimization. Operations Research, 67(4):1090–1105, 2019.
  53. H. Lam and E. Zhou. Quantifying input uncertainty in stochastic optimization. In Proceedings of the 2015 Winter Simulation Conference. IEEE, 2015.
  54. Mind the gap: Assessing temporal generalization in neural language models. In nips21, 2021.
  55. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11(10):733–739, 2010.
  56. Generalizing study results: a potential outcomes perspective. Epidemiology (Cambridge, Mass.), 28(4):553, 2017.
  57. Balancing covariates via propensity score weighting. Journal of the American Statistical Association, 113(521):390–400, 2018.
  58. Detecting and correcting for label shift with black box predictors. In ICML18, 2018.
  59. N. Meinshausen and P. Bühlmann. Maximin effects in inhomogeneous large-scale data. The Annals of Statistics, 43(4):1801–1830, 2015.
  60. The effect of natural distribution shift on question answering models. In International Conference on Machine Learning, pages 6905–6916. PMLR, 2020.
  61. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In Proceedings of the 38th International Conference on Machine Learning, 2021.
  62. M. A. Miner. Cumulative Damage in Fatigue. Journal of Applied Mechanics, 12(3):A159–A164, 1945.
  63. Distributional smoothing with virtual adversarial training. arXiv:1507.00677 [stat.ML], 2015.
  64. W. K. Newey. The asymptotic variance of semiparametric estimators. Econometrica, pages 1349–1382, 1994.
  65. W. K. Newey. Convergence rates and asymptotic normality for series estimators. Journal of Econometrics, 79(1):147–168, 1997.
  66. W. K. Newey and D. McFadden. Large sample estimation and hypothesis testing. Handbook of Econometrics, 4:2111–2245, 1994.
  67. A. B. Owen. Sobol’ indices and Shapley value. SIAM/ASA Journal on Uncertainty Quantification, 2(1):245–251, 2014.
  68. A. B. Owen. Monte Carlo Theory, Methods, and Examples. 2015. Online at http://statweb.stanford.edu/~owen/mc/.
  69. Capability maturity model, version 1.1. IEEE Software, 10(4):18–27, 1993.
  70. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  71. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society, Series B, 78(5):947–1012, 2016.
  72. J. Praestgaard and J. A. Wellner. Exchangeably weighted bootstraps of the general empirical process. The Annals of Probability, 21(4):2053–2086, 1993.
  73. T. Pyzdek and P. A. Keller. Quality engineering handbook. CRC Press, 2003.
  74. Dataset shift in machine learning. MIT Press, 2008.
  75. Do ImageNet classifiers generalize to ImageNet? In Proceedings of the 36th International Conference on Machine Learning, 2019.
  76. P. R. Rosenbaum. Design of Observational Studies. Springer Series in Statistics. Springer, 2010.
  77. P. R. Rosenbaum. A new u-statistic with superior design sensitivity in matched observational studies. Biometrics, 67(3):1017–1027, 2011.
  78. The risks of invariant risk minimization. In Proceedings of the Ninth International Conference on Learning Representations, 2021.
  79. Anchor regression: heterogeneous data meets causality. arXiv:1801.06229 [stat.ME], 2018.
  80. Adapting visual category models to new domains. In Proceedings of the European Conference on Computer Vision, pages 213–226. Springer, 2010.
  81. Extending the wilds benchmark for unsupervised adaptation. In Advances in Neural Information Processing Systems 21, 2021.
  82. Learning from a biased sample. arXiv:2209.01754 [stat.ME], 2022.
  83. Global sensitivity analysis: the primer. John Wiley & Sons, 2008.
  84. On causal and anticausal learning. In Proceedings of the 29th International Conference on Machine Learning, pages 1255–1262, 2012.
  85. Distributionally robust logistic regression. In Advances in Neural Information Processing Systems 28, pages 1576–1584, 2015.
  86. Do image classifiers generalize across time? arXiv:1906.02168 [cs.LG], 2019.
  87. H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000.
  88. Shapley effects for global sensitivity analysis: Theory and computation. SIAM/ASA Journal on Uncertainty Quantification, 4(1):1060–1083, 2016.
  89. M. Staib and S. Jegelka. Distributionally robust optimization and generalization in kernel methods. In Advances in Neural Information Processing Systems, pages 9131–9141, 2019.
  90. C. J. Stone. Optimal global rates of convergence for nonparametric regression. Annals of Statistics, 10(4):1040–1053, 1982.
  91. The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society: Series A (Statistics in Society), 174(2):369–386, 2011.
  92. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8:985–1005, 2007.
  93. Measuring robustness to natural distribution shifts in image classification. In Advances in Neural Information Processing Systems 20, 2020.
  94. E. Tipton. Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. Journal of Educational and Behavioral Statistics, 38(3):239–266, 2013.
  95. E. Tipton and R. B. Olsen. A review of statistical methods for generalizing from evaluations of educational interventions. Educational Researcher, 47(8):516–524, 2018.
  96. E. Tipton and L. R. Peck. A design-based approach to improve external validity in welfare policy evaluations. Evaluation review, 41(4):326–356, 2017.
  97. Plex: Towards reliability using pretrained large model extensions. arxiv:2207.07411 [stat.ML], 2022.
  98. Direct density ratio estimation for large-scale covariate shift adaptation. Journal of Information Processing, 17:138–155, 2009.
  99. A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.
  100. From data to decisions: Distributionally robust optimization is optimal. Management Science, 67(6):3387–3402, 2021.
  101. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Internal Medicine, 181(8):1065–1070, 2021.
  102. Robust fine-tuning of zero-shot models. arXiv:2109.01903 [cs.CV], 2021.
  103. A distributional interpretation of robust optimization. Mathematics of Operations Research, 37(1):95–110, 2012.
  104. S. Yadlowsky. On cross-fitting with plug-in estimators, Oct 2022. URL https://www.syadlowsky.com/blog/semiparametric/2022/10/24/on-cross-fitting-with-plug-in-estimators.html.
  105. Evaluating treatment prioritization rules via rank-weighted average treatment effects. arXiv:2111.07966 [stat.ML], 2021.
  106. Bounds on the conditional and average treatment effect with unobserved confounding factors. Annals of Statistics, 50(5):2587–2615, 2022.
  107. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS medicine, 15(11):e1002683, 2018.
  108. ”why did the model fail?”: Attributing model performance changes to distribution shifts, 2022. URL https://arxiv.org/abs/2210.10769.
  109. Q. Zhao. Covariate balancing propensity score by tailored loss functions. The Annals of Statistics, 47(2):965–993, 2019.
  110. Cross-Validated Targeted Minimum-Loss-Based Estimation, pages 459–474. Springer New York, New York, NY, 2011. ISBN 978-1-4419-9782-1. doi: 10.1007/978-1-4419-9782-1˙27. URL https://doi.org/10.1007/978-1-4419-9782-1_27.
Citations (22)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub