Benchmarking Distribution Shift in Tabular Data with TableShift (2312.07577v3)
Abstract: Robustness to distribution shift has become a growing concern for text and image models as they transition from research subjects to deployment in the real world. However, high-quality benchmarks for distribution shift in tabular machine learning tasks are still lacking despite the widespread real-world use of tabular data and differences in the models used for tabular data in comparison to text and images. As a consequence, the robustness of tabular models to distribution shift is poorly understood. To address this issue, we introduce TableShift, a distribution shift benchmark for tabular data. TableShift contains 15 binary classification tasks in total, each with an associated shift, and includes a diverse set of data sources, prediction targets, and distribution shifts. The benchmark covers domains including finance, education, public policy, healthcare, and civic participation, and is accessible using only a few lines of Python code via the TableShift API. We conduct a large-scale study comparing several state-of-the-art tabular data models alongside robust learning and domain generalization methods on the benchmark tasks. Our study demonstrates (1) a linear trend between in-distribution (ID) and out-of-distribution (OOD) accuracy; (2) domain robustness methods can reduce shift gaps but at the cost of reduced ID accuracy; (3) a strong relationship between shift gap (difference between ID and OOD performance) and shifts in the label distribution. The benchmark data, Python package, model implementations, and more information about TableShift are available at https://github.com/mlfoundations/tableshift and https://tableshift.org .
- Domain-adversarial neural networks. arXiv preprint arXiv:1412.4446, 2014.
- Geometric dataset distances via optimal transport. Advances in Neural Information Processing Systems, 33:21428–21439, 2020.
- American Heart Association. The Facts About High Blood Pressure. https://www.heart.org/en/health-topics/high-blood-pressure/the-facts-about-high-blood-pressure, 2017. Accessed: 2023-01-06.
- American Heart Association. Health Threats from High Blood Pressure. https://www.heart.org/en/health-topics/high-blood-pressure/health-threats-from-high-blood-pressure, 2022. Accessed: 2023-01-06.
- American National Election Studies (ANES). ANES Time Series Cumulative Data File, 1948-2020, 2020.
- Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
- American Diabetes Association. Economic costs of diabetes in the us in 2017. Diabetes care, 41(5):917–928, 2018.
- Algernon Austin. A good credit score did not protect latino and black borrowers. 2012.
- Exploring the landscape of distributional robustness for question answering models. arXiv preprint arXiv:2210.12517, 2022.
- Welfare state regimes, unemployment and health: a comparative study of the relationship between unemployment and self-reported health in 23 european countries. Journal of Epidemiology & Community Health, 63(2):92–98, 2009.
- It’s compaslicated: The messy relationship between rai datasets and algorithmic fairness benchmarks. arXiv preprint arXiv:2106.05498, 2021.
- Matias Barenstein. Propublica’s compas data revisited. arXiv preprint arXiv:1906.04711, 2019.
- Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, pages 115–123. PMLR, 2013.
- Unemployment and suicide. evidence for a causal association? Journal of Epidemiology & Community Health, 57(8):594–600, 2003.
- Deep neural networks and tabular data: A survey. arXiv preprint arXiv:2110.01889, 2021.
- The effect of employment on psychological health in mid-adulthood: findings from the 1970 british cohort study. Journal of Epidemiology & Community Health, 62(5):e10–e10, 2008.
- Centers for Disease Control and Prevention. National Diabetes Statistics Report. https://www.cdc.gov/diabetes/data/statistics-report/index.html, 2022. Accessed: 2023-01-05.
- Centers for Disease Control and Prevention (CDC). National Health and Nutrition Examination Survey Questionnaire, Examination Protocol, and Laboratory Protocol (1999, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017), 2017.
- Centers for Disease Control and Prevention (CDC). BRFSS Survey Data (2015, 2017, 2019, 2021), 2021.
- Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
- Diabetes in the african-american medicare population: morbidity, quality of care, and resource utilization. Diabetes Care, 21(7):1090–1095, 1998.
- Explaining the black-white homeownership gap. Washington, DC: Urban Institute., 25:2021, 2019.
- Giselle M Corbie-Smith. Minority recruitment and participation in health research. North Carolina medical journal, 65(6):385–387, 2004.
- Retiring adult: New datasets for fair machine learning. Advances in Neural Information Processing Systems, 34, 2021.
- Catboost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363, 2018.
- Scientific American Editors. Clinical trials have far too little racial and ethnic diversity. Scientific American, 2018.
- Edward Metz. ASSISTments: From Research to Practice at Scale in Education. https://ies.ed.gov/blogs/research/post/assistments-from-research-to-practice-at-scale-in-education, 2020. Accessed: 2023-06-01.
- Federal Trade Commission. Press Release: Marketers of Blood-Pressure App Settle FTC Charges Regarding Accuracy of App Readings. https://www.ftc.gov/news-events/news/press-releases/2016/12/marketers-blood-pressure-app-settle-ftc-charges-regarding-accuracy-app-readings, 2016. Accessed: 2023-02-09.
- Imagenet: Constructing a large-scale image database. Journal of vision, 9(8):1037–1037, 2009.
- FICO. The Explainable Machine Learning Challenge. https://community.fico.com/s/explainable-machine-learning-challenge, 2019. Accessed: 2023-01-10.
- A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the conference on fairness, accountability, and transparency, pages 329–338, 2019.
- High blood pressure and cardiovascular disease. Hypertension, 75(2):285–292, 2020.
- Subgroup robustness grows on trees: An empirical baseline investigation. In Advances in Neural Information Processing Systems, 2022.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
- An open source automl benchmark. arXiv preprint arXiv:1907.00909, 2019.
- Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, 34:18932–18943, 2021.
- Why do tree-based models still outperform deep learning on tabular data? arXiv preprint arXiv:2207.08815, 2022.
- In search of lost domain generalization. arXiv preprint arXiv:2007.01434, 2020.
- Hanson, Melanie. Average Cost of College & Tuition. https://educationdata.org/average-cost-of-college, 2023. Accessed: 2023-06-01.
- Multitask learning and benchmarking with clinical time series data. Scientific data, 6(1):1–18, 2019.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Development and validation of qdiabetes-2018 risk prediction algorithm to estimate future risk of type 2 diabetes: cohort study. bmj, 359, 2017.
- Tabtransformer: Tabular data modeling using contextual embeddings. arXiv preprint arXiv:2012.06678, 2020.
- Natalie Jacewicz. Why are health studies so white? The Atlantic, 2016.
- Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
- Well-tuned simple nets excel on tabular datasets. Advances in Neural Information Processing Systems, 34, 2021.
- Michael Kahn. Diabetes Data Set, 1994.
- Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, 2017.
- Kevin L. Matthews II. There’s a ’credit gap’ between Black and white Americans, and it’s holding Black Americans back from building wealth. https://www.businessinsider.com/personal-finance/credit-gap-black-americans-building-wealth-2021-1, 2021. Accessed: 2023-01-10.
- Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637–5664. PMLR, 2021.
- R. Kohavi and B. Becker. UCI adult data set., 1996.
- Out-of-distribution generalization via risk extrapolation (rex). In International Conference on Machine Learning, pages 5815–5826. PMLR, 2021.
- Large-scale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33:8847–8860, 2020.
- Domain generalization with adversarial feature learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5400–5409, 2018.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Are we learning yet? a meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- Shifts: A dataset of real distributional shift across multiple large-scale tasks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2022.
- Credit health during the covid-19 pandemic. The Urban Data Institute, 2022.
- Dataperf: Benchmarks for data-centric ai development. arXiv preprint arXiv:2207.10062, 2022.
- Megan Leonhart. Black and Hispanic Americans often have lower credit scores—here’s why they’re hit harder. https://www.cnbc.com/2021/01/28/black-and-hispanic-americans-often-have-lower-credit-scores.html, 2021. Accessed: 2023-01-10.
- The effect of natural distribution shift on question answering models. In International Conference on Machine Learning, pages 6905–6916. PMLR, 2020.
- Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning, pages 7721–7735. PMLR, 2021.
- Health and social precursors of unemployment in young men in great britain. Journal of Epidemiology & Community Health, 50(4):415–422, 1996.
- High Blood Pressure: Causes and Risk Factors. https://www.nhlbi.nih.gov/health/high-blood-pressure/causes, 2022. Accessed: 2023-01-08.
- National Science Foundation. Directorate for Engineering Data Management Plans Guidance for Principal Investigators. https://www.nsf.gov/eng/general/ENG_DMP_Policy.pdf, 2018. Accessed: 2023-06-01.
- Risk models and scores for type 2 diabetes: systematic review. Bmj, 343, 2011.
- Health-care utilization as a proxy in disability determination. 2018.
- Diversity in clinical and biomedical research: a promise yet to be fulfilled. PLoS medicine, 12(12):e1001918, 2015.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
- Neural oblivious decision ensembles for deep learning on tabular data. arXiv preprint arXiv:1909.06312, 2019.
- PR Newswire. Black and Hispanic Americans on the U.S. financial system: "The odds were always against me," new Credit Sesame survey finds. https://www.prnewswire.com/news-releases/black-and-hispanic-americans-on-the-us-financial-system-the-odds-were-always-against-me-new-credit-sesame-survey-finds-301215072.html, 2021. Accessed: 2023-01-10.
- Benchmarking deep learning models on large healthcare datasets. Journal of biomedical informatics, 83:112–134, 2018.
- Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400. PMLR, 2019.
- Early prediction of sepsis from clinical data: the physionet/computing in cardiology challenge 2019. In 2019 Computing in Cardiology (CinC), pages Page–1. IEEE, 2019.
- A meta-analysis of overfitting in machine learning. Advances in Neural Information Processing Systems, 32, 2019.
- Distributionally robust neural networks. In International Conference on Learning Representations, 2019.
- Satish Misra. Blood pressure app study shows that top health app was highly inaccurate. https://www.imedicalapps.com/2016/03/instant-blood-pressure-app-study/, 2016. Accessed: 2023-02-09.
- Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
- Saint: Improved neural networks for tabular data via row attention and contrastive pre-training. arXiv preprint arXiv:2106.01342, 2021.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Impact of hba1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed research international, 2014, 2014.
- Deep coral: Correlation alignment for deep domain adaptation. In European conference on computer vision, pages 443–450. Springer, 2016.
- Hyperglycemia: an independent marker of in-hospital mortality in patients with undiagnosed diabetes. The Journal of Clinical Endocrinology & Metabolism, 87(3):978–982, 2002.
- United States Department of Justice. Press Release: Justice Department Reaches $335 Million Settlement to Resolve Allegations of Lending Discrimination by Countrywide Financial Corporation. https://www.justice.gov/opa/pr/justice-department-reaches-335-million-settlement-resolve-allegations-lending-discrimination, 2011. Accessed: 2023-01-10.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Mimic-extract: A data extraction, preprocessing, and representation pipeline for mimic-iii. In Proceedings of the ACM conference on health, inference, and learning, pages 222–235, 2020.
- Building risk prediction models for type 2 diabetes using machine learning techniques. Preventing Chronic Disease, 16, 2019.
- Adversarial domain adaptation with domain mixup. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 6502–6509, 2020.
- Improve unsupervised domain adaptation with mixup training. arXiv preprint arXiv:2001.00677, 2020.
- Doro: Distributional and outlier robust optimization. In International Conference on Machine Learning, pages 12345–12355. PMLR, 2021.
- Coping with label shift via distributionally robust optimisation. In International Conference on Learning Representations, 2020.