Active Statistical Inference (2403.03208v2)
Abstract: Inspired by the concept of active learning, we propose active inference$\unicode{x2013}$a methodology for statistical inference with machine-learning-assisted data collection. Assuming a budget on the number of labels that can be collected, the methodology uses a machine learning model to identify which data points would be most beneficial to label, thus effectively utilizing the budget. It operates on a simple yet powerful intuition: prioritize the collection of labels for data points where the model exhibits uncertainty, and rely on the model's predictions where it is confident. Active inference constructs provably valid confidence intervals and hypothesis tests while leveraging any black-box machine learning model and handling any data distribution. The key point is that it achieves the same level of accuracy with far fewer samples than existing baselines relying on non-adaptively-collected data. This means that for the same number of collected samples, active inference enables smaller confidence intervals and more powerful p-values. We evaluate active inference on datasets from public opinion research, census analysis, and proteomics.
- Prediction-powered inference. Science, 382(6671):669–674, 2023a.
- Prediction-powered inference: Data sets, 2023b. URL https://doi.org/10.5281/zenodo.8397451.
- PPI++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453, 2023c.
- Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671, 2019.
- Semi-supervised linear regression. Journal of the American Statistical Association, 117(540):2238–2251, 2022.
- Agnostic active learning. In Proceedings of the 23rd international conference on Machine learning, pages 65–72, 2006.
- Learning economic parameters from revealed preferences. In Web and Internet Economics: 10th International Conference, WINE 2014, Beijing, China, December 14-17, 2014. Proceedings 10, pages 338–353. Springer, 2014.
- Inferring welfare maximizing treatment assignment under budget constraints. Journal of Econometrics, 167(1):168–196, 2012.
- The structural context of posttranslational modifications at a proteome-wide scale. PLoS biology, 20(5):e3001636, 2022.
- Adaptive instrument design for indirect experiments. arXiv preprint arXiv:2312.02438, 2023.
- Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
- How many labelers do you have? a closer look at gold-standard labels. arXiv preprint arXiv:2206.12041, 2022.
- Double/debiased machine learning for treatment and structural parameters, 2018.
- Semiparametric efficient inference in adaptive experiments. arXiv preprint arXiv:2311.18274, 2023.
- Retiring adult: New datasets for fair machine learning. Advances in neural information processing systems, 34:6478–6490, 2021.
- Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.
- Aryeh Dvoretzky. Asymptotic normality for sums of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory, volume 6, pages 513–536. University of California Press, 1972.
- Prediction de-correlated inference. arXiv preprint arXiv:2312.06478, 2023.
- Confidence intervals for policy evaluation in adaptive experiments. Proceedings of the national academy of sciences, 118(15):e2014602118, 2021.
- Adaptive experimental design using the propensity score. Journal of Business & Economic Statistics, 29(1):96–108, 2011.
- Steve Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends® in Machine Learning, 7(2-3):131–309, 2014.
- The theory of response-adaptive randomization in clinical trials. John Wiley & Sons, 2006.
- Combining satellite imagery and machine learning to predict poverty. Science, 353(6301):790–794, 2016.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Graham Kalton. Introduction to survey sampling. Number 35. Sage Publications, 2020.
- Adaptive treatment assignment in experiments for policy choice. Econometrica, 89(1):113–132, 2021.
- Efficient adaptive experimental design for average treatment effect estimation. arXiv preprint arXiv:2002.05308, 2020.
- Designing stratified sampling in economic and business surveys. Journal of applied statistics, 42(10):2080–2099, 2015.
- Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- So you want to run an experiment, now what? some simple rules of thumb for optimal experimental design. Experimental Economics, 14:439–457, 2011.
- Assumption-lean and data-adaptive post-prediction inference. arXiv preprint arXiv:2311.14220, 2023.
- Valid inference after prediction. arXiv preprint arXiv:2306.13746, 2023.
- Dankit K Nassiuma. Survey sampling: Theory and methods, 2001.
- Art B. Owen. Monte Carlo theory, methods and examples. https://artowen.su.domains/mc/, 2013.
- Pew. American trends panel (ATP) wave 79, 2020. URL https://www.pewresearch.org/science/dataset/american-trends-panel-wave-79/.
- Herbert Robbins. Some aspects of the sequential design of experiments. 1952.
- Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association, 90(429):122–129, 1995.
- Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866, 1994.
- A generalizable and accessible approach to machine learning with global satellite imagery. Nature communications, 12(1):4392, 2021.
- D Rubin. Multiple imputation for nonresponse in surveys. Wiley Series in Probability and Statistics, page 1, 1987.
- Donald B Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.
- Donald B Rubin. Multiple imputation after 18+ years. Journal of the American statistical Association, 91(434):473–489, 1996.
- Model assisted survey sampling. Springer Science & Business Media, 2003.
- Less is more: Active learning with support vector machines. In ICML, volume 2, page 6, 2000.
- Burr Settles. Active learning literature survey. Department of Computer Sciences, University of Wisconsin-Madison, 2009.
- Support vector machine active learning with applications to text classification. Journal of machine learning research, 2(Nov):45–66, 2001.
- Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
- Promises and pitfalls of threshold-based auto-labeling. Advances in Neural Information Processing Systems, 36, 2023.
- Transfer learning from deep features for remote sensing and poverty mapping. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
- Matt Zdun. Machine politics: How America casts and counts its votes. Reuters, 2022.
- Semi-supervised inference: General theory and estimation of means. Annals of Statistics, 47(5):2538–2566, 2019.
- Active learning for optimal intervention design in causal models. Nature Machine Intelligence, pages 1–10, 2023.
- Statistical inference with M-estimators on adaptively collected data. Advances in neural information processing systems, 34:7460–7471, 2021.
- High-dimensional semi-supervised learning: in search of optimal inference of the mean. Biometrika, 109(2):387–403, 2022.
- Cross-prediction-powered inference. arXiv preprint arXiv:2309.16598, 2023.