Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Do We Really Even Need Data? (2401.08702v2)

Published 14 Jan 2024 in stat.ME and cs.LG

Abstract: As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g. rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as outcome variables. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to this so-called ``inference with predicted data'' problem and elucidate three potential sources of error: (i) the relationship between predicted outcomes and their true, unobserved counterparts, (ii) robustness of the machine learning model to resampling or uncertainty about the training data, and (iii) appropriately propagating not just bias but also uncertainty from predictions into the ultimate inference procedure.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Leendert C Rookmaaker. Review of the european perception of the african rhinoceros. Journal of Zoology, 265(4):365–376, 2005.
  2. All yesterdays: Unique and speculative views of dinosaurs and others prehistoric animals. Irregular books, 2012.
  3. Machine learning identifies interacting genetic variants contributing to breast cancer risk: A case study in finnish cases and controls. Scientific Reports, 8(1):13149, 2018.
  4. Case–control association mapping by proxy using family history of disease. Nature Genetics, 49(3):325–331, 2017.
  5. A transcriptome-wide association study of high-grade serous epithelial ovarian cancer identifies new susceptibility genes and splice variants. Nature Genetics, 51(5):815–823, 2019.
  6. Enterotypes of the human gut microbiome. Nature, 473(7346):174–180, 2011.
  7. A gene-based association method for mapping traits using reference transcriptome data. Nature genetics, 47(9):1091–1098, 2015.
  8. Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the united states. Proceedings of the National Academy of Sciences, 114(50):13108–13113, 2017.
  9. Pragmatic randomized clinical trials: best practices and statistical guidance. Health Services and Outcomes Research Methodology, 19:23–35, 2019.
  10. What is a pragmatic clinical trial. J Invest Dermatol, 135(6):1–3, 2015.
  11. Estimates of global mortality attributable to particulate air pollution using satellite imagery. Environmental Research, 120:33–42, 2013.
  12. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  13. Can chatgpt provide intelligent diagnoses? a comparative study between predictive models and chatgpt to define a new medical diagnostic bot. Expert Systems with Applications, 235:121186, 2024.
  14. Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences, 117(48):30266–30275, 2020.
  15. Prediction-powered inference. Science, 382(6671):669–674, 2023.
  16. Ppi++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453, 2023.
  17. Assumption-lean and data-adaptive post-prediction inference. arXiv preprint arXiv:2311.14220, 2023.
  18. Using large language model annotations for valid downstream statistical inference in social science: Design-based semi-supervised learning. arXiv preprint arXiv:2306.04746, 2023.
  19. Detecting influenza epidemics using search engine query data. Nature, 457(7232):1012–1014, 2009.
  20. The parable of google flu: traps in big data analysis. science, 343(6176):1203–1205, 2014.
  21. Yougeng Lu. Beyond air pollution at home: Assessment of personal exposure to pm2. 5 using activity-based travel demand model and low-cost air sensor network data. Environmental Research, 201:111549, 2021.
  22. Eye in outer space: satellite imageries of container ports can predict world stock returns. Humanities and Social Sciences Communications, 10(1):1–16, 2023.
  23. Computer vision uncovers predictors of physical urban change. Proceedings of the National Academy of Sciences, 114(29):7571–7576, 2017.
  24. Local news and national politics. American Political Science Review, 113(2):372–384, 2019.
  25. Democracy and growth: Evidence from a machine learning indicator. European journal of political economy, 45:85–107, 2016.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com