Estimating prevalence with precision and accuracy (2507.06061v1)
Abstract: Unlike classification, whose goal is to estimate the class of each data point in a dataset, prevalence estimation or quantification is a task that aims to estimate the distribution of classes in a dataset. The two main tasks in prevalence estimation are to adjust for bias, due to the prevalence in the training dataset, and to quantify the uncertainty in the estimate. The standard methods used to quantify uncertainty in prevalence estimates are bootstrapping and Bayesian quantification methods. It is not clear which approach is ideal in terms of precision (i.e. the width of confidence intervals) and coverage (i.e. the confidence intervals being well-calibrated). Here, we propose Precise Quantifier (PQ), a Bayesian quantifier that is more precise than existing quantifiers and with well-calibrated coverage. We discuss the theory behind PQ and present experiments based on simulated and real-world datasets. Through these experiments, we establish the factors which influence quantification precision: the discriminatory power of the underlying classifier; the size of the labeled dataset used to train the quantifier; and the size of the unlabeled dataset for which prevalence is estimated. Our analysis provides deep insights into uncertainty quantification for quantification learning.