Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 149 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 112 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees (2509.02896v1)

Published 2 Sep 2025 in cs.DB and cs.AI

Abstract: LLMs are being increasingly used as a building block in data systems to process large text datasets. To do so, LLM model providers offer multiple LLMs with different sizes, spanning various cost-quality trade-offs when processing text at scale. Top-of-the-line LLMs (e.g., GPT-4o, Claude Sonnet) operate with high accuracy but are prohibitively expensive when processing many records. To avoid high costs, more affordable but lower quality LLMs (e.g., GPT-4o-mini, Claude Haiku) can be used to process records, but we need to ensure that the overall accuracy does not deviate substantially from that of the top-of-the-line LLMs. The model cascade framework provides a blueprint to manage this trade-off, by using the confidence of LLMs in their output (e.g., log-probabilities) to decide on which records to use the affordable LLM. However, existing solutions following this framework provide only marginal cost savings and weak theoretical guarantees because of poor estimation of the quality of the affordable LLM's outputs. We present BARGAIN, a method that judiciously uses affordable LLMs in data processing to significantly reduce cost while providing strong theoretical guarantees on the solution quality. BARGAIN employs a novel adaptive sampling strategy and statistical estimation procedure that uses data and task characteristics and builds on recent statistical tools to make accurate estimations with tight theoretical guarantees. Variants of BARGAIN can support guarantees on accuracy, precision, or recall of the output. Experimental results across 8 real-world datasets show that BARGAIN reduces cost, on average, by up to 86% more than state-of-the-art, while providing stronger theoretical guarantees on accuracy of output, with similar gains when guaranteeing a desired level of precision or recall.

Summary

  • The paper presents BARGAIN, which employs adaptive sampling and a betting-based estimator to achieve strong, finite-sample guarantees while cutting expensive oracle calls.
  • It demonstrates up to 86% reduction in oracle usage and significant improvements in recall and precision compared to prior cascaded models like SUPG.
  • BARGAIN dynamically selects cascade thresholds through data-aware sampling, enabling scalable, cost-efficient LLM-powered data processing in real-world applications.

Cost-Efficient LLM-Powered Data Processing with Guarantees: An Analysis of BARGAIN

Introduction and Motivation

The paper "Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees" (2509.02896) addresses the challenge of deploying LLMs for data processing at scale, where the high cost of top-tier models (e.g., GPT-4o) is prohibitive for large datasets. The central problem is to minimize inference costs by judiciously using cheaper proxy models (e.g., GPT-4o-mini) while providing rigorous guarantees on output quality—specifically, accuracy, precision, or recall—relative to the oracle model. The work critiques prior model cascade approaches, particularly SUPG, for their weak (asymptotic) guarantees and suboptimal utility, and introduces BARGAIN, a method that leverages adaptive sampling and modern statistical estimation to provide strong, non-asymptotic guarantees with significantly improved cost savings.

Model Cascade Framework and Problem Formalization

The model cascade paradigm is formalized as follows: Given a dataset DD, an expensive oracle model O\mathcal{O}, and a cheap proxy model P\mathcal{P}, the system must decide for each record whether to use P\mathcal{P} or O\mathcal{O}, based on a proxy confidence score S(x)\mathcal{S}(x). The decision is parameterized by a cascade threshold ρ\rho; records with S(x)ρ\mathcal{S}(x) \geq \rho are handled by the proxy, others by the oracle. The goal is to set ρ\rho to maximize utility (e.g., cost savings, recall, or precision) while ensuring, with high probability 1δ1-\delta, that the output meets a user-specified quality target TT (accuracy, precision, or recall).

Three query types are considered:

  • Accuracy Target (AT): Minimize oracle calls while ensuring output accuracy T\geq T.
  • Precision Target (PT): Maximize recall under a fixed oracle budget, ensuring precision T\geq T.
  • Recall Target (RT): Maximize precision under a fixed oracle budget, ensuring recall T\geq T.

Limitations of Prior Work

Existing methods, notably SUPG [kang2020approximate], use importance sampling and central limit theorem (CLT)-based estimation to set cascade thresholds. However, these approaches:

  • Provide only asymptotic guarantees, failing to control the probability of missing the target at finite sample sizes.
  • Rely on worst-case union bounds and data-agnostic estimation, leading to conservative thresholds and poor utility.
  • Do not adapt sampling to the quality target or the empirical distribution of proxy scores and labels, resulting in inefficient use of the oracle budget.

The BARGAIN Framework

BARGAIN introduces a principled, data- and task-aware approach to model cascade threshold selection, with the following key innovations:

1. Adaptive Sampling

Rather than sampling uniformly or by fixed importance weights, BARGAIN adaptively samples records based on the current candidate threshold and observed labels. For each candidate threshold ρ\rho, it samples from DρD^\rho (records with S(x)ρ\mathcal{S}(x) \geq \rho), focusing the oracle budget on the region most relevant for threshold estimation. Sampling continues until the estimation procedure is confident about whether ρ\rho meets the target. Figure 1

Figure 1

Figure 1

Figure 1: Overview of Model Cascade. The cascade threshold ρ\rho partitions records for proxy or oracle processing; BARGAIN adaptively samples to estimate the optimal ρ\rho.

2. Modern Statistical Estimation

BARGAIN replaces classical concentration inequalities (e.g., Hoeffding, Chernoff) with a hypothesis-testing approach based on the betting framework of Waudby-Smith and Ramdas [waudby2024estimating]. This estimator leverages both the empirical mean and variance of observed labels, yielding tighter, data-dependent confidence bounds, especially when the variance is low (i.e., when the proxy is highly accurate at high scores). Figure 2

Figure 2

Figure 2: Comparison of lower bounds on true precision at fixed sample sizes. BARGAIN's estimator (betting-based) is significantly tighter than Hoeffding or Chernoff, especially at high precision.

3. Data-Aware Threshold Selection

BARGAIN's threshold selection algorithm incorporates a tolerance parameter η\eta to exploit empirical monotonicity in real-world datasets (i.e., precision/accuracy typically decreases monotonically with decreasing proxy score). This allows the method to avoid unnecessary union bounds over all candidate thresholds, further tightening guarantees and improving utility. Figure 3

Figure 3: Overview of Model Cascade. BARGAIN iterates over candidate thresholds, adaptively sampling and estimating until the optimal threshold is found.

4. Query-Specific Variants

BARGAIN instantiates the above principles for each query type:

  • BARGAINA_A (AT): Adaptive sampling with accuracy estimation; supports both single and per-class thresholds.
  • BARGAINP_P (PT): Adaptive sampling with precision estimation; maximizes recall.
  • BARGAINR_R (RT): Adaptive sampling with recall estimation; maximizes precision. For highly imbalanced datasets, BARGAINR_R introduces a positive density parameter β\beta to relax guarantees in a controlled manner, as strict guarantees are impossible when positives are rare. Figure 4

Figure 4

Figure 4: Positive density in real-world datasets. Most positives are concentrated at high proxy scores, motivating BARGAINR_R's density-based pre-filtering.

Theoretical Guarantees

BARGAIN provides non-asymptotic, finite-sample guarantees: For any user-specified δ\delta, the probability that the selected threshold fails to meet the target is at most δ\delta. This is achieved by:

  • Using anytime-valid hypothesis tests for mean estimation [waudby2024estimating].
  • Carefully accounting for multiple testing across thresholds via the tolerance parameter η\eta and union bounds only where necessary.
  • Ensuring that, for each threshold, estimation is performed on an i.i.d. sample from the relevant subset, even under adaptive sampling.

Empirical Evaluation

BARGAIN is evaluated on eight real-world datasets, including both LLM-based and classical ML tasks, and compared to SUPG and a Naive (Hoeffding-based) baseline. Key findings:

  • AT Queries: BARGAIN reduces oracle usage by up to 86% more than SUPG, with both BARGAINA_A-A (single threshold) and BARGAINA_A-M (per-class thresholds) outperforming baselines.
  • PT/RT Queries: BARGAINP_P-A and BARGAINR_R-A achieve up to 118% higher recall and 19% higher precision, respectively, than SUPG, especially on imbalanced datasets.
  • Robustness: BARGAIN maintains guarantees under adversarial and noisy proxy score settings, whereas SUPG frequently fails to meet the target when δ\delta is small or the data is adversarially constructed. Figure 5

Figure 5

Figure 5: Summary of AT Query Results. BARGAIN achieves substantially higher cost savings than SUPG and Naive baselines.

Figure 6

Figure 6

Figure 6: Meeting target in Onto Dataset. BARGAIN consistently meets the required target, while SUPG fails as δ\delta decreases.

Implementation Considerations

  • Parameter Selection: The number of candidate thresholds MM and minimum samples per threshold cc exhibit diminishing returns beyond moderate values (e.g., M=20M=20, c=1%5%c=1\%-5\% of data size). The tolerance parameter η\eta should be set to zero in most real-world settings due to empirical monotonicity.
  • Sampling: BARGAIN supports both with- and without-replacement sampling, with the latter enabling sample reuse across thresholds.
  • Computational Overhead: The adaptive sampling and estimation procedures are lightweight compared to LLM inference costs, and the method is scalable to large datasets. Figure 7

    Figure 7: Impact of MM, cc, and η\eta on BARGAINA_A-A. Utility stabilizes for moderate MM and is robust to cc and η\eta.

Trade-offs and Limitations

  • Strictness of Guarantees: BARGAIN's guarantees are non-asymptotic and hold for any finite sample size, in contrast to SUPG's asymptotic guarantees. However, for extremely imbalanced datasets (very low positive rates), strict recall guarantees are impossible without sacrificing utility; BARGAINR_R-A allows for controlled relaxation via β\beta.
  • Calibration Dependence: While BARGAIN's guarantees do not require well-calibrated proxy scores, utility is maximized when proxy confidence is well-aligned with true correctness. Poor calibration can reduce the fraction of records eligible for proxy processing.
  • Extension to Open-Ended Tasks: The current framework is tailored to classification tasks; extending to open-ended generation or semantic joins requires further research on proxy score definition and estimation.

Practical Implications and Future Directions

BARGAIN provides a practical, theoretically sound solution for cost-efficient LLM-powered data processing in production systems, enabling substantial cost savings without sacrificing output quality. Its modular design allows integration into existing LLM orchestration and data management frameworks. Future work includes:

  • Extending BARGAIN to open-ended tasks and entity matching, where proxy score calibration and transitivity properties may be leveraged for further optimization.
  • Investigating adaptive candidate threshold selection and more sophisticated proxy routing in multi-model cascades.
  • Exploring tighter integration with uncertainty calibration techniques [krishnan_improving_2020, kapoor2024calibration] to further improve utility.

Conclusion

BARGAIN advances the state of the art in LLM-powered data processing by combining adaptive, task-aware sampling with modern statistical estimation to deliver strong, non-asymptotic guarantees and superior empirical utility. Its principled approach addresses the limitations of prior work and provides a robust foundation for scalable, cost-effective deployment of LLMs in data-centric applications.

Dice Question Streamline Icon: https://streamlinehq.com
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 67 likes.

Upgrade to Pro to view all of the tweets about this paper: