Calibrate-Extrapolate: Rethinking Prevalence Estimation with Black Box Classifiers
Abstract: In computational social science, researchers often use a pre-trained, black box classifier to estimate the frequency of each class in unlabeled datasets. A variety of prevalence estimation techniques have been developed in the literature, each yielding an unbiased estimate if certain stability assumption holds. This work introduces a framework to rethink the prevalence estimation process as calibrating the classifier outputs against ground truth labels to obtain the joint distribution of a base dataset and then extrapolating to the joint distribution of a target dataset. We call this framework "Calibrate-Extrapolate". It clarifies what stability assumptions must hold for a prevalence estimation technique to yield accurate estimates. In the calibration phase, the techniques assume only a stable calibration curve between a calibration dataset and the full base dataset. This allows for the classifier outputs to be used for disproportionate random sampling, thus improving the efficiency of calibration. In the extrapolation phase, some techniques assume a stable calibration curve while some assume stable class-conditional densities. We discuss the stability assumptions from a causal perspective. By specifying base and target joint distributions, we can generate simulated datasets, as a way to build intuitions about the impacts of assumption violations. This also leads to a better understanding of how the classifier's predictive power affects the accuracy of prevalence estimates: the greater the predictive power, the lower the sensitivity to violations of stability assumptions in the extrapolation phase. We illustrate the framework with an application that estimates the prevalence of toxic comments on news topics over time on Reddit, Twitter/X, and YouTube, using Jigsaw's Perspective API as a black box classifier. Finally, we summarize several practical advice for prevalence estimation.
- Types of Out-of-Distribution Texts and How to Detect Them. In EMNLP.
- Exposure to Ideologically Diverse News and Opinion on Facebook. Science.
- Quantification via Probability Estimators. In ICDM.
- The Importance of Calibration for Estimating Proportions From Annotations. In NAACL.
- Chakravarti, N. 1989. Isotonic Median Regression: A Linear Programming Approach. Mathematics of Operations Research.
- A Review of Bootstrap Confidence Intervals. Journal of the Royal Statistical Society: Series B (Methodological).
- Optimizing Text Quantifiers for Multivariate Loss Functions. ACM Transactions on Knowledge Discovery from Data.
- Forman, G. 2005. Counting Positives Accurately Despite Inaccurate Classification. In ECML.
- Forman, G. 2008. Quantifying Counts and Costs via Classification. Data Mining and Knowledge Discovery.
- A Review on Quantification Learning. ACM Computing Surveys.
- Class Distribution Estimation Based on the Hellinger Distance. Information Sciences.
- Quantifying Biodiversity: Procedures and Pitfalls in the Measurement and Comparison of Species Richness. Ecology letters.
- On Calibration of Modern Neural Networks. In ICML.
- Towards Measuring Adversarial Twitter Interactions Against Candidates in the US Midterm Elections. In ICWSM.
- VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In ICWSM.
- Causal Direction of Data Collection Matters: Implications of Causal and Anticausal Learning for NLP. In EMNLP.
- A Unifying View on Dataset Shift in Classification. Pattern recognition.
- Neyman, J. 1934. On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection. Journal of the Royal Statistical Society.
- Measuring the Prevalence of Anti-Social Behavior in Online Communities. ACM on Human-Computer Interaction, (CSCW).
- Just Another Day on Twitter: A Complete 24 Hours of Twitter Data. In ICWSM.
- Platt, J. 1999. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Advances in large margin classifiers.
- Political Discussion Is Abundant in Non-political Subreddits (And Less Toxic). In ICWSM.
- On the Rise of Fear Speech in Online Social Media. PNAS.
- On Causal and Anticausal Learning. In ICML.
- Adjusting Coronavirus Prevalence Estimates for Laboratory Test Kit Error. American Journal of Epidemiology.
- Cross-Partisan Discussions on YouTube: Conservatives Talk to Liberals but Liberals Don’t Talk to Conservatives. In ICWSM.
- Ex Machina: Personal Attacks Seen at Scale. In TheWebConf.
- Scalable and generalizable social bot detection through data selection. In AAAI.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.