Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Prediction De-Correlated Inference: A safe approach for post-prediction inference (2312.06478v3)

Published 11 Dec 2023 in stat.ME and stat.ML

Abstract: In modern data analysis, it is common to use machine learning methods to predict outcomes on unlabeled datasets and then use these pseudo-outcomes in subsequent statistical inference. Inference in this setting is often called post-prediction inference. We propose a novel assumption-lean framework for statistical inference under post-prediction setting, called Prediction De-Correlated Inference (PDC). Our approach is safe, in the sense that PDC can automatically adapt to any black-box machine-learning model and consistently outperform the supervised counterparts. The PDC framework also offers easy extensibility for accommodating multiple predictive models. Both numerical results and real-world data analysis demonstrate the superiority of PDC over the state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Angelopoulos, A. N., Bates, S., Fannjiang, C., Jordan, M. I., and Zrnic, T. (2023a), “Prediction-powered inference,” Science, 382, 669–674.
  2. Angelopoulos, A. N., Duchi, J. C., and Zrnic, T. (2023b), “PPI++: Efficient Prediction-Powered Inference,” arXiv preprint arXiv:2311.01453.
  3. Azriel, D., Brown, L. D., Sklar, M., Berk, R., Buja, A., and Zhao, L. (2021), “Semi-supervised linear regression,” Journal of the American Statistical Association, 1–14.
  4. Breiman, L. (2001), “Random forests,” Machine Learning, 45, 5–32.
  5. Cai, T., Li, M., and Liu, M. (2022), “Semi-supervised Triply Robust Inductive Transfer Learning,” arXiv preprint arXiv:2209.04977.
  6. Chakrabortty, A. and Cai, T. (2018), “Efficient and adaptive linear regression in semi-supervised settings,” The Annals of Statistics, 46, 1541 – 1572.
  7. Chakrabortty, A., Dai, G., and Carroll, R. J. (2022), “Semi-Supervised Quantile Estimation: Robust and Efficient Inference in High Dimensional Settings,” arXiv preprint arXiv:2201.10208.
  8. Kawakita, M. and Kanamori, T. (2013), “Semi-supervised learning with density-ratio estimation,” Machine Learning, 91, 189–209.
  9. Khoury, S., Massad, D., and Fardous, T. (1999), “Mortality and causes of death in Jordan 1995-96: assessment by verbal autopsy.” Bulletin of the World Health Organization, 77, 641.
  10. Kriegler, B. and Berk, R. (2010), “Small area estimation of the homeless in Los Angeles: An application of cost-sensitive stochastic gradient boosting,” The Annals of Applied Statistics, 4, 1234–1255.
  11. Lv, J. and Liu, J. S. (2014), “Model selection principles in misspecified models,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 76, 141–167.
  12. Miao, J., Miao, X., Wu, Y., Zhao, J., and Lu, Q. (2023), “Assumption-lean and Data-adaptive Post-Prediction Inference,” arXiv preprint arXiv:2311.14220.
  13. Michaelson, J. J., Loguercio, S., and Beyer, A. (2009), “Detection and interpretation of expression quantitative trait loci (eQTL),” Methods, 48, 265–276.
  14. Robins, J. M. and Rotnitzky, A. (1995), “Semiparametric efficiency in multivariate regression models with missing data,” Journal of the American Statistical Association, 90, 122–129.
  15. Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1995), “Analysis of semiparametric regression models for repeated outcomes in the presence of missing data,” Journal of the American Statistical Association, 90, 106–121.
  16. Schmutz, H., Humbert, O., and Mattei, P.-A. (2023), “Don’t fear the unlabelled: safe semi-supervised learning via debiasing,” in The Eleventh International Conference on Learning Representations.
  17. Song, S., Lin, Y., and Zhou, Y. (2023), “A General M-estimation Theory in Semi-Supervised Framework,” Journal of the American Statistical Association, 1–11.
  18. Wang, S., McCormick, T. H., and Leek, J. T. (2020), “Methods for correcting inference based on outcomes predicted by machine learning,” Proceedings of the National Academy of Sciences, 117, 30266–30275.
  19. Zhang, A., Brown, L. D., and Cai, T. T. (2019), “Semi-supervised inference: General theory and estimation of means,” The Annals of Statistics, 47, 2538 – 2566.
  20. Zhang, Y. and Bradic, J. (2022), “High-dimensional semi-supervised learning: in search of optimal inference of the mean,” Biometrika, 109, 387–403.

Summary

We haven't generated a summary for this paper yet.