Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Probably Approximately Correct (PAC) Labels

Updated 30 June 2025

PAC labels are a hybrid labeling method that uses calibrated AI uncertainty estimates and selective expert reviews to ensure error rates remain below a user-defined tolerance.
The framework dynamically adjusts an uncertainty threshold to delegate low-risk cases to AI models and route ambiguous data points for expert annotation.
Empirical evidence across NLP, vision, and biochemical assays demonstrates substantial expert cost reductions while upholding strict labeling error guarantees.

A probably approximately correct (PAC) label is a label assigned to a data point such that, with high probability, the fraction of labeling errors does not exceed a user-specified tolerance. The methodology developed in "Probably Approximately Correct Labels" formalizes a rigorous yet cost-effective approach for dataset curation by supplementing limited expert annotation with AI model predictions, while controlling the proportion of labeling errors via explicit PAC-style statistical guarantees. The framework offers a general, model-agnostic recipe for constructing datasets where the average labeling error is smaller than a target value with specified confidence, substantially reducing the need for costly expert input.

1. The PAC Labeling Principle: Definition and Algorithmic Structure

The PAC labeling approach leverages predictions from pre-trained AI models, accompanied by model-specific uncertainty estimates, and routes only the most ambiguous cases to expert annotation. For an input $X_i$ , the model provides a predicted label $\hat{Y}_i$ and an uncertainty score $U_i$ . The core procedure is as follows:

Uncertainty-Thresholded Hybrid Labeling
- Select an uncertainty threshold $\hat{u}$ such that data with $U_i < \hat{u}$ (lower uncertainty) are labeled by the model, and labels for $U_i \geq \hat{u}$ are obtained from the expert:
$\tilde{Y}_i = Y_i \cdot \mathbf{1}\{U_i \geq \hat{u}\} + \hat{Y}_i \cdot \mathbf{1}\{U_i < \hat{u}\}$

The threshold $\hat{u}$ is optimized to minimize expert cost subject to the PAC constraint on overall labeling error.

Error Bounds and Statistical Guarantee
- Using a random calibration subset with expert-provided ground-truth, the error rate for each possible uncertainty threshold is estimated.
- For each threshold, a high-confidence upper bound (mean upper bound or CLT-based) on the average labeling error is computed:
$\text{meanUB}(\{Z_j\}_{j=1}^m;\alpha) = \hat{\mu}_Z + z_{1-\alpha} \frac{\hat{\sigma}_Z}{\sqrt{m}}$

The final PAC guarantee is:

$\frac{1}{n} \sum_{i=1}^n \ell(Y_i, \tilde Y_i) \leq \epsilon \quad \text{with probability at least}~1-\alpha$

where $\epsilon$ is the user-specified error tolerance and $\alpha$ the failure probability.

Multi-Model Generalization (PAC Router)
- For $k$ models or annotation sources, a routing policy $w_\theta(X_i)$ is trained to allocate data points to distinct models/expert, balancing error and cost by solving a (smoothed) minimax slack-constrained optimization using the full calibration set.

2. Quantified Labeling Error: Statistical Guarantees

The output labels are probably approximately correct in the sense of the classic PAC learning definition:

Probably: The guarantee holds with high probability ( $1-\alpha$ ) over the random calibration set and the algorithm's randomization.
Approximately correct: On average, the labeling error does not exceed $\epsilon$ (for 0-1 loss; analogous results hold for squared loss or other metrics).

The PAC bound is certified over the entire labeled set, conditional on a fixed dataset (transductive setting).

The calibration procedure allocates expert budget to estimate the error of using model predictions at different uncertainty levels. The uncertainty threshold is set to the largest value such that the upper bound on empirical error remains below $\epsilon$ .

3. Practical Implementation and Calibration

Uncertainty Estimation: The framework relies on meaningful, calibrated uncertainty scores from models.
- For LLMs (e.g., GPT-4o), self-reported or prompt-based confidence is used.
- For vision models (e.g., ResNet-152), $U_i = 1 - \max_j f_j(X_i)$ , with $f_j(X_i)$ the predicted class probability.
- For domain-specific models (e.g., AlphaFold), domain-calibrated uncertainty (such as pLDDT for protein structures) is employed.
Uncertainty Calibration: Multicalibration can be applied to remove global or subgroup-specific overconfidence, further tightening error control, particularly in presence of distributional shifts or model blind spots. If an AI systematically under/overestimates its reliability for particular regions or groups, calibration on the collected expert-verified batch adjusts the uncertainty scores per cluster/bin.
Statistical Estimation: The upper bound on error is estimated using the calibration set via concentration inequalities or the CLT, ensuring sample-efficiency and valid PAC control. The route and thresholding mechanism can be fully differentiated (e.g., by replacing step functions with sigmoids) to allow gradient-based policy optimization when coordinating multiple models.

4. Empirical Results and Applications

The framework is validated across diverse annotation applications:

Text Labeling: PAC labeling with GPT-4o for political bias, misinformation, and stance detection demonstrates expert label reductions up to $28\%$ without exceeding specified error rates.
Image Classification: Applying the method to ImageNet with ResNet-152, error rate matches or outperforms semi-automated baselines while reducing expert labeling by $\approx 60\%$ .
Protein Structure Prediction: Using AlphaFold and pLDDT uncertainty, PAC labeling is effective for sequence-level classification, offering savings in costly biochemical assays.

In settings with multiple models of different costs or expertise (e.g., LLM and smaller anchor models), the PAC router routes points among sources. This results in further reductions in expert cost and achieves error rates aligned with the chosen threshold.

5. Advantages and Limitations

Advantages

Rigorous Guarantees: Users can set the maximum labeling error with statistical confidence.
Expert Efficiency: The majority of "easy" points can be labeled cheaply; human effort is reserved for uncertain cases.
Adaptivity: Calibrates to the distribution and reliability of each model.
Model-Agnostic: Works with any model with valid (or calibtatable) uncertainty outputs.

Limitations

Dependence on Uncertainty Estimates: Requires meaningful, calibrated uncertainty—miscalibrated scores can increase cost or violate error control.
Expert Calibration Set: Some up-front expert annotation remains necessary to estimate error for threshold selection and model calibration.
Transductive Guarantee: Error is controlled over the dataset at hand; there is no guarantee for new, unseen data points outside the original set.
Complexity for Multi-Model Routing: Optimizing the router in multi-model settings entails additional implementation and compute.

6. Algorithmic Formulas and Optimization

Label Assignment:

$\tilde{Y}_i = Y_i \cdot \mathbf{1}\{U_i \geq \hat{u}\} + \hat{Y}_i \cdot \mathbf{1}\{U_i < \hat{u}\}$

Threshold Selection:

$\hat{u} = \min \{ U_i : \hat{L}^{U_i}(\alpha) > \epsilon \}$

where $\hat{L}^{U_i}(\alpha)$ is a high-confidence upper bound on error for uncertainty bin $U_i$ .

PAC-level Guarantee:

$\frac{1}{n} \sum_{i=1}^n \ell(Y_i, \tilde Y_i) \leq \epsilon \quad \text{w.p. at least}~1-\alpha$

Routing in Multi-Model Case (Differentiable Surrogate):

$\mathbb{E}_{X_i, Y_i}\left[ \sum_{j=1}^k w_{\theta,j}(X_i) \ell(Y_i, \hat{Y}_i^j) \sigma(\tilde u - U_i^j) \right] \leq \epsilon$

where $w_{\theta,j}(X_i)$ are the routing weights.

7. Impact for Dataset Curation and Model Training Pipelines

PAC labeling provides a practical route to rigorous and efficient dataset creation—a major bottleneck in real-world applications involving language, vision, and structured scientific domains. The methodology reduces reliance on human labeling, leverages modern AI capabilities, and guarantees error control in the resulting datasets. This enables scalable dataset curation for supervised learning, integration in active learning loops, and robust test set construction in situations where perfect labeling is infeasible due to cost or volume. The technique is straightforward to implement for any setting where model uncertainty is accessible and can be extended to sequential, cost-aware, or hybrid annotation strategies.

Summary Table

Aspect	PAC Labeling (per the paper)
Guarantee	Error $\leq \epsilon$ w.p. $\geq 1-\alpha$ (chosen by user)
Labeling Rule	Expert for uncertain points; AI for low-uncertainty points
Cost Impact	Reduces expert effort by up to $60\%$ , empirical evidence
Applications	NLP (LLMs), vision (ResNet/ImageNet), science (AlphaFold)
Calibration	Multicalibration, CLT upper bounds, threshold tuning
Multi-model	PAC router assigns points to minimize aggregate cost/error

This approach constitutes a statistically principled foundation for hybrid labeling pipelines in modern supervised learning.

PDF Markdown Chat (Upgrade)