Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-Prediction-Powered Inference (2309.16598v3)

Published 28 Sep 2023 in stat.ML, cs.LG, and stat.ME

Abstract: While reliable data-driven decision-making hinges on high-quality labeled data, the acquisition of quality labels often involves laborious human annotations or slow and expensive scientific measurements. Machine learning is becoming an appealing alternative as sophisticated predictive techniques are being used to quickly and cheaply produce large amounts of predicted labels; e.g., predicted protein structures are used to supplement experimentally derived structures, predictions of socioeconomic indicators from satellite imagery are used to supplement accurate survey data, and so on. Since predictions are imperfect and potentially biased, this practice brings into question the validity of downstream inferences. We introduce cross-prediction: a method for valid inference powered by machine learning. With a small labeled dataset and a large unlabeled dataset, cross-prediction imputes the missing labels via machine learning and applies a form of debiasing to remedy the prediction inaccuracies. The resulting inferences achieve the desired error probability and are more powerful than those that only leverage the labeled data. Closely related is the recent proposal of prediction-powered inference, which assumes that a good pre-trained model is already available. We show that cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model. Finally, we observe that cross-prediction gives more stable conclusions than its competitors; its confidence intervals typically have significantly lower variability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Prediction-powered inference. Science, 382(6671):669–674, 2023a.
  2. PPI++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453, 2023b.
  3. Asymptotics of cross-validation. arXiv preprint arXiv:2001.11111, 2020.
  4. Semi-supervised linear regression. Journal of the American Statistical Association, 117(540):2238–2251, 2022.
  5. Comprehensive survey of deep learning in remote sensing: theories, tools, and challenges for the community. Journal of applied remote sensing, 11(4):042609–042609, 2017.
  6. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
  7. Cross-validation: what does it estimate and how well does it do it? Journal of the American Statistical Association, pages 1–12, 2023.
  8. Cross-validation confidence intervals for test error. Advances in Neural Information Processing Systems, 33:16339–16350, 2020.
  9. Efficient and adaptive estimation for semiparametric models, volume 4. Springer, 1993.
  10. The structural context of posttranslational modifications at a proteome-wide scale. PLoS biology, 20(5):e3001636, 2022.
  11. Models as approximations I. Statistical Science, 34(4):523–544, 2019a.
  12. Models as approximations II. Statistical Science, 34(4):545–565, 2019b.
  13. Satellite-based estimates reveal widespread forest degradation in the Amazon. Global Change Biology, 26(5):2956–2969, 2020.
  14. Semi-supervised quantile estimation: Robust and efficient inference in high dimensional settings. arXiv preprint arXiv:2201.10208, 2022.
  15. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
  16. Double/debiased machine learning for treatment and structural parameters, 2018.
  17. Locally robust semiparametric estimation. Econometrica, 90(4):1501–1535, 2022.
  18. A simple and general debiased machine learning theorem with finite-sample guarantees. Biometrika, 110(1):257–264, 2023.
  19. Retiring adult: New datasets for fair machine learning. Advances in neural information processing systems, 34:6478–6490, 2021.
  20. Sandrine Dudoit and Mark J van der Laan. Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Statistical methodology, 2(2):131–154, 2005.
  21. High-resolution global maps of 21st-century forest cover change. science, 342(6160):850–853, 2013.
  22. On the nonparametric estimation of functionals. In Proceedings of the Second Prague Symposium on Asymptotic Statistics, volume 473, pages 474–482. North-Holland Amsterdam, 1979.
  23. Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794, 2016.
  24. Tailored inference for finite populations: conditional validity and transfer across distributions. Biometrika, page asad022, 2023.
  25. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  26. On high-dimensional Gaussian comparisons for cross-validation. arXiv preprint arXiv:2211.04958, 2022.
  27. Chris AJ Klaassen. Consistent estimation of the influence function of locally asymptotically linear estimators. The Annals of Statistics, 15(4):1548–1562, 1987.
  28. Testing statistical hypotheses, volume 4. Springer Nature, 2022.
  29. B Ya Levit. On the efficiency of a class of non-parametric estimates. Theory of Probability & Its Applications, 20(4):723–740, 1976.
  30. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
  31. Valid inference after prediction. arXiv preprint arXiv:2306.13746, 2023.
  32. Whitney K Newey. The asymptotic variance of semiparametric estimators. Econometrica: Journal of the Econometric Society, pages 1349–1382, 1994.
  33. Large sample estimation and hypothesis testing. Handbook of econometrics, 4:2111–2245, 1994.
  34. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
  35. Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866, 1994.
  36. A deep learning approach for population estimation from satellite imagery. In Proceedings of the 1st ACM SIGSPATIAL Workshop on Geospatial Humanities, pages 47–54, 2017.
  37. Peter M Robinson. Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, pages 931–954, 1988.
  38. A generalizable and accessible approach to machine learning with global satellite imagery. Nature communications, 12(1):4392, 2021.
  39. Semiparametric regression for repeated outcomes with nonignorable nonresponse. Journal of the american statistical association, 93(444):1321–1339, 1998.
  40. D Rubin. Multiple imputation for nonresponse in surveys. Wiley Series in Probability and Statistics, page 1, 1987.
  41. Donald B Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.
  42. Donald B Rubin. Multiple imputation after 18+ years. Journal of the American statistical Association, 91(434):473–489, 1996.
  43. Joseph L Schafer. Multiple imputation: a primer. Statistical methods in medical research, 8(1):3–15, 1999.
  44. Global, 30-m resolution continuous fields of tree cover: Landsat-based rescaling of modis vegetation continuous fields with lidar-based estimates of error. International Journal of Digital Earth, 6(5):427–448, 2013.
  45. A general m-estimation theory in semi-supervised framework. Journal of the American Statistical Association, pages 1–11, 2023.
  46. Mapping poverty using mobile phone and satellite data. Journal of The Royal Society Interface, 14(127):20160690, 2017.
  47. T Tony Cai and Zijian Guo. Semisupervised inference for explained variance in high dimensional linear regression and its applications. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(2):391–419, 2020.
  48. Highly accurate protein structure prediction for the human proteome. Nature, 596(7873):590–596, 2021.
  49. Aad W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998.
  50. A survey on semi-supervised learning. Machine learning, 109(2):373–440, 2020.
  51. Methods for correcting inference based on outcomes predicted by machine learning. Proceedings of the National Academy of Sciences, 117(48):30266–30275, 2020.
  52. Halbert White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica: journal of the Econometric Society, pages 817–838, 1980.
  53. Halbert White. Consequences and detection of misspecified nonlinear regression models. Journal of the American Statistical Association, 76(374):419–433, 1981.
  54. Galaxy zoo 2: detailed morphological classifications for 304 122 galaxies from the Sloan Digital Sky Survey. Monthly Notices of the Royal Astronomical Society, 435(4):2835–2860, 2013.
  55. The Sloan Digital Sky Survey: Technical summary. The Astronomical Journal, 120(3):1579, 2000.
  56. Semi-supervised inference: General theory and estimation of means. Annals of Statistics, 47(5):2538–2566, 2019.
  57. High-dimensional semi-supervised learning: in search of optimal inference of the mean. Biometrika, 109(2):387–403, 2022.
  58. Chatgpt chemistry assistant for text mining and the prediction of mof synthesis. Journal of the American Chemical Society, 145(32):18048–18062, 2023.
  59. Introduction to semi-supervised learning. Springer Nature, 2022.
  60. Xiaojin Jerry Zhu. Semi-supervised learning literature survey. 2005.
Citations (14)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets