Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees (2509.02896v1)

Published 2 Sep 2025 in cs.DB and cs.AI

Abstract: LLMs are being increasingly used as a building block in data systems to process large text datasets. To do so, LLM model providers offer multiple LLMs with different sizes, spanning various cost-quality trade-offs when processing text at scale. Top-of-the-line LLMs (e.g., GPT-4o, Claude Sonnet) operate with high accuracy but are prohibitively expensive when processing many records. To avoid high costs, more affordable but lower quality LLMs (e.g., GPT-4o-mini, Claude Haiku) can be used to process records, but we need to ensure that the overall accuracy does not deviate substantially from that of the top-of-the-line LLMs. The model cascade framework provides a blueprint to manage this trade-off, by using the confidence of LLMs in their output (e.g., log-probabilities) to decide on which records to use the affordable LLM. However, existing solutions following this framework provide only marginal cost savings and weak theoretical guarantees because of poor estimation of the quality of the affordable LLM's outputs. We present BARGAIN, a method that judiciously uses affordable LLMs in data processing to significantly reduce cost while providing strong theoretical guarantees on the solution quality. BARGAIN employs a novel adaptive sampling strategy and statistical estimation procedure that uses data and task characteristics and builds on recent statistical tools to make accurate estimations with tight theoretical guarantees. Variants of BARGAIN can support guarantees on accuracy, precision, or recall of the output. Experimental results across 8 real-world datasets show that BARGAIN reduces cost, on average, by up to 86% more than state-of-the-art, while providing stronger theoretical guarantees on accuracy of output, with similar gains when guaranteeing a desired level of precision or recall.

Summary

The paper presents BARGAIN, which employs adaptive sampling and a betting-based estimator to achieve strong, finite-sample guarantees while cutting expensive oracle calls.
It demonstrates up to 86% reduction in oracle usage and significant improvements in recall and precision compared to prior cascaded models like SUPG.
BARGAIN dynamically selects cascade thresholds through data-aware sampling, enabling scalable, cost-efficient LLM-powered data processing in real-world applications.

Cost-Efficient LLM-Powered Data Processing with Guarantees: An Analysis of BARGAIN

Introduction and Motivation

The paper "Cut Costs, Not Accuracy: LLM-Powered Data Processing with Guarantees" (2509.02896) addresses the challenge of deploying LLMs for data processing at scale, where the high cost of top-tier models (e.g., GPT-4o) is prohibitive for large datasets. The central problem is to minimize inference costs by judiciously using cheaper proxy models (e.g., GPT-4o-mini) while providing rigorous guarantees on output quality—specifically, accuracy, precision, or recall—relative to the oracle model. The work critiques prior model cascade approaches, particularly SUPG, for their weak (asymptotic) guarantees and suboptimal utility, and introduces BARGAIN, a method that leverages adaptive sampling and modern statistical estimation to provide strong, non-asymptotic guarantees with significantly improved cost savings.

Model Cascade Framework and Problem Formalization

The model cascade paradigm is formalized as follows: Given a dataset $D$ , an expensive oracle model $\mathcal{O}$ , and a cheap proxy model $\mathcal{P}$ , the system must decide for each record whether to use $\mathcal{P}$ or $\mathcal{O}$ , based on a proxy confidence score $\mathcal{S}(x)$ . The decision is parameterized by a cascade threshold $\rho$ ; records with $\mathcal{S}(x) \geq \rho$ are handled by the proxy, others by the oracle. The goal is to set $\rho$ to maximize utility (e.g., cost savings, recall, or precision) while ensuring, with high probability $1-\delta$ , that the output meets a user-specified quality target $T$ (accuracy, precision, or recall).

Three query types are considered:

Accuracy Target (AT): Minimize oracle calls while ensuring output accuracy $\geq T$ .
Precision Target (PT): Maximize recall under a fixed oracle budget, ensuring precision $\geq T$ .
Recall Target (RT): Maximize precision under a fixed oracle budget, ensuring recall $\geq T$ .

Limitations of Prior Work

Existing methods, notably SUPG [kang2020approximate], use importance sampling and central limit theorem (CLT)-based estimation to set cascade thresholds. However, these approaches:

Provide only asymptotic guarantees, failing to control the probability of missing the target at finite sample sizes.
Rely on worst-case union bounds and data-agnostic estimation, leading to conservative thresholds and poor utility.
Do not adapt sampling to the quality target or the empirical distribution of proxy scores and labels, resulting in inefficient use of the oracle budget.

The BARGAIN Framework

BARGAIN introduces a principled, data- and task-aware approach to model cascade threshold selection, with the following key innovations:

1. Adaptive Sampling

Rather than sampling uniformly or by fixed importance weights, BARGAIN adaptively samples records based on the current candidate threshold and observed labels. For each candidate threshold $\rho$ , it samples from $D^\rho$ (records with $\mathcal{S}(x) \geq \rho$ ), focusing the oracle budget on the region most relevant for threshold estimation. Sampling continues until the estimation procedure is confident about whether $\rho$ meets the target.

Figure 1: Overview of Model Cascade. The cascade threshold $\rho$ partitions records for proxy or oracle processing; BARGAIN adaptively samples to estimate the optimal $\rho$ .

2. Modern Statistical Estimation

BARGAIN replaces classical concentration inequalities (e.g., Hoeffding, Chernoff) with a hypothesis-testing approach based on the betting framework of Waudby-Smith and Ramdas [waudby2024estimating]. This estimator leverages both the empirical mean and variance of observed labels, yielding tighter, data-dependent confidence bounds, especially when the variance is low (i.e., when the proxy is highly accurate at high scores).

Figure 2: Comparison of lower bounds on true precision at fixed sample sizes. BARGAIN's estimator (betting-based) is significantly tighter than Hoeffding or Chernoff, especially at high precision.

3. Data-Aware Threshold Selection

BARGAIN's threshold selection algorithm incorporates a tolerance parameter $\eta$ to exploit empirical monotonicity in real-world datasets (i.e., precision/accuracy typically decreases monotonically with decreasing proxy score). This allows the method to avoid unnecessary union bounds over all candidate thresholds, further tightening guarantees and improving utility.

Figure 3: Overview of Model Cascade. BARGAIN iterates over candidate thresholds, adaptively sampling and estimating until the optimal threshold is found.

4. Query-Specific Variants

BARGAIN instantiates the above principles for each query type:

BARGAIN $_A$ (AT): Adaptive sampling with accuracy estimation; supports both single and per-class thresholds.
BARGAIN $_P$ (PT): Adaptive sampling with precision estimation; maximizes recall.
BARGAIN $_R$ (RT): Adaptive sampling with recall estimation; maximizes precision. For highly imbalanced datasets, BARGAIN $_R$ introduces a positive density parameter $\beta$ to relax guarantees in a controlled manner, as strict guarantees are impossible when positives are rare.

Figure 4: Positive density in real-world datasets. Most positives are concentrated at high proxy scores, motivating BARGAIN $_R$ 's density-based pre-filtering.

Theoretical Guarantees

BARGAIN provides non-asymptotic, finite-sample guarantees: For any user-specified $\delta$ , the probability that the selected threshold fails to meet the target is at most $\delta$ . This is achieved by:

Using anytime-valid hypothesis tests for mean estimation [waudby2024estimating].
Carefully accounting for multiple testing across thresholds via the tolerance parameter $\eta$ and union bounds only where necessary.
Ensuring that, for each threshold, estimation is performed on an i.i.d. sample from the relevant subset, even under adaptive sampling.

Empirical Evaluation

BARGAIN is evaluated on eight real-world datasets, including both LLM-based and classical ML tasks, and compared to SUPG and a Naive (Hoeffding-based) baseline. Key findings:

AT Queries: BARGAIN reduces oracle usage by up to 86% more than SUPG, with both BARGAIN $_A$ -A (single threshold) and BARGAIN $_A$ -M (per-class thresholds) outperforming baselines.
PT/RT Queries: BARGAIN $_P$ -A and BARGAIN $_R$ -A achieve up to 118% higher recall and 19% higher precision, respectively, than SUPG, especially on imbalanced datasets.
Robustness: BARGAIN maintains guarantees under adversarial and noisy proxy score settings, whereas SUPG frequently fails to meet the target when $\delta$ is small or the data is adversarially constructed.

Figure 5: Summary of AT Query Results. BARGAIN achieves substantially higher cost savings than SUPG and Naive baselines.

Figure 6: Meeting target in Onto Dataset. BARGAIN consistently meets the required target, while SUPG fails as $\delta$ decreases.

Implementation Considerations

Parameter Selection: The number of candidate thresholds $M$ and minimum samples per threshold $c$ exhibit diminishing returns beyond moderate values (e.g., $M=20$ , $c=1\%-5\%$ of data size). The tolerance parameter $\eta$ should be set to zero in most real-world settings due to empirical monotonicity.
Sampling: BARGAIN supports both with- and without-replacement sampling, with the latter enabling sample reuse across thresholds.
Computational Overhead: The adaptive sampling and estimation procedures are lightweight compared to LLM inference costs, and the method is scalable to large datasets.
Figure 7: Impact of $M$ , $c$ , and $\eta$ on BARGAIN $_A$ -A. Utility stabilizes for moderate $M$ and is robust to $c$ and $\eta$ .

Trade-offs and Limitations

Strictness of Guarantees: BARGAIN's guarantees are non-asymptotic and hold for any finite sample size, in contrast to SUPG's asymptotic guarantees. However, for extremely imbalanced datasets (very low positive rates), strict recall guarantees are impossible without sacrificing utility; BARGAIN $_R$ -A allows for controlled relaxation via $\beta$ .
Calibration Dependence: While BARGAIN's guarantees do not require well-calibrated proxy scores, utility is maximized when proxy confidence is well-aligned with true correctness. Poor calibration can reduce the fraction of records eligible for proxy processing.
Extension to Open-Ended Tasks: The current framework is tailored to classification tasks; extending to open-ended generation or semantic joins requires further research on proxy score definition and estimation.

Practical Implications and Future Directions

BARGAIN provides a practical, theoretically sound solution for cost-efficient LLM-powered data processing in production systems, enabling substantial cost savings without sacrificing output quality. Its modular design allows integration into existing LLM orchestration and data management frameworks. Future work includes:

Extending BARGAIN to open-ended tasks and entity matching, where proxy score calibration and transitivity properties may be leveraged for further optimization.
Investigating adaptive candidate threshold selection and more sophisticated proxy routing in multi-model cascades.
Exploring tighter integration with uncertainty calibration techniques [krishnan_improving_2020, kapoor2024calibration] to further improve utility.

Conclusion

BARGAIN advances the state of the art in LLM-powered data processing by combining adaptive, task-aware sampling with modern statistical estimation to deliver strong, non-asymptotic guarantees and superior empirical utility. Its principled approach addresses the limitations of prior work and provides a robust foundation for scalable, cost-effective deployment of LLMs in data-centric applications.