Dice Question Streamline Icon: https://streamlinehq.com

Feasibility of collecting gold-standard validation labels exceeding one-quarter of the sample

Determine, for applied economics studies that use large language models to automate measurement of text-based economic concepts for downstream estimation, the proportion of applications in which it is practically feasible—given time and cost constraints—to collect gold-standard validation labels on more than 25% of the study sample.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper shows that bias-correcting estimates with a validation sample can restore validity when using LLM outputs for estimation tasks. In Monte Carlo simulations based on Congressional bill topics, the authors find that precision gains from combining LLM labels with validation data dissipate when the validation sample reaches roughly 25% of the full sample.

This raises a practical question about the real-world feasibility of collecting such a large proportion of gold-standard labels across applied settings, given the time and financial costs of human annotation. Quantifying this feasibility is important for determining when the proposed validation-based approach is operationally viable.

References

The share of applications in which it would actually be feasible (from a time and cost perspective) to collect gold-standard labels for more than a quarter of the sample is an open question.

Large Language Models: An Applied Econometric Framework (2412.07031 - Ludwig et al., 9 Dec 2024) in Section 5, Subsection "Monte Carlo Simulations based on Congressional Legislation"