Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Assumption-Lean and Data-Adaptive Post-Prediction Inference (2311.14220v4)

Published 23 Nov 2023 in stat.ME, cs.LG, and stat.ML

Abstract: A primary challenge facing modern scientific research is the limited availability of gold-standard data which can be costly, labor-intensive, or invasive to obtain. With the rapid development of ML, scientists can now employ ML algorithms to predict gold-standard outcomes with variables that are easier to obtain. However, these predicted outcomes are often used directly in subsequent statistical analyses, ignoring imprecision and heterogeneity introduced by the prediction procedure. This will likely result in false positive findings and invalid scientific conclusions. In this work, we introduce PoSt-Prediction Adaptive inference (PSPA) that allows valid and powerful inference based on ML-predicted data. Its "assumption-lean" property guarantees reliable statistical inference without assumptions on the ML prediction. Its "data-adaptive" feature guarantees an efficiency gain over existing methods, regardless of the accuracy of ML prediction. We demonstrate the statistical superiority and broad applicability of our method through simulations and real-data applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Prediction-powered inference. Science, 382(6671):669–674.
  2. Ppi++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453.
  3. Semi-supervised linear regression. Journal of the American Statistical Association, 117(540):2238–2251.
  4. Predicting tissue-specific gene expression from whole blood transcriptome. Science Advances, 7(14):eabd6991.
  5. Efficient and adaptive estimation for semiparametric models, volume 4. Springer.
  6. Model-assisted survey estimation with modern prediction techniques.
  7. Breiman, L. (2001). Random forests. Machine learning, 45:5–32.
  8. Satellite-based estimates reveal widespread forest degradation in the amazon. Global Change Biology, 26(5):2956–2969.
  9. Efficient and adaptive linear regression in semi-supervised settings.
  10. Measurement error models with auxiliary data. The Review of Economic Studies, 72(2):343–366.
  11. Accurate proteome-wide missense variant effect prediction with alphamissense. Science, 381(6664):eadg7492.
  12. GTEx Consortium (2020). The gtex consortium atlas of genetic regulatory effects across human tissues. Science, 369(6509):1318–1330.
  13. The genotype-tissue expression (gtex) pilot analysis: multitissue gene regulation in humans. Science, 348(6235):648–660.
  14. Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, pages 315–331.
  15. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  16. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589.
  17. On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv preprint arXiv:2003.12408.
  18. Deep learning. nature, 521(7553):436–444.
  19. The genotype-tissue expression (gtex) project. Nature genetics, 45(6):580–585.
  20. Large sample estimation and hypothesis testing. Handbook of Econometrics, 4:2111–2245.
  21. Robins, J. M. (2000). Robust estimation in sequentially ignorable missing data and causal inference models. In Proceedings of the American Statistical Association, volume 1999, pages 6–10. Indianapolis, IN.
  22. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866.
  23. A general m-estimation theory in semi-supervised framework. Journal of the American Statistical Association, pages 1–11.
  24. A deep learning approach to antibiotic discovery. Cell, 180(4):688–702.
  25. Tsiatis, A. A. (2006). Semiparametric theory and missing data.
  26. Genome-wide association studies. Nature Reviews Methods Primers, 1(1):59.
  27. Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university press.
  28. Hypergraph factorization for multi-tissue gene expression imputation. Nature machine intelligence, 5(7):739–753.
  29. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60.
  30. Imputing gene expression in uncollected tissues within and beyond gtex. The American Journal of Human Genetics, 98(4):697–708.
  31. Large margin semi-supervised learning. Journal of Machine Learning Research, 8(8).
  32. Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences, 117(48):30266–30275.
  33. Semi-supervised inference: General theory and estimation of means.
Citations (9)

Summary

We haven't generated a summary for this paper yet.