Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
127 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
53 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Error-Reject Curves

Updated 1 July 2025
  • Error-reject curves evaluate machine learning models with a reject option by plotting performance on non-rejected data against the fraction of data retained as the rejection threshold varies.
  • These curves are crucial in safety-critical applications like medical diagnostics where abstaining from uncertain predictions is preferable to making a potentially costly error.
  • Analyzing variants such as precision or recall-reject curves provides more application-specific insights, particularly for imbalanced datasets where standard accuracy can be misleading.

Error-reject curves, also known as accuracy-reject curves (ARCs), are fundamental tools for evaluating and calibrating machine learning models equipped with a reject option—where the model abstains from prediction in cases of high uncertainty. These curves provide quantitative and visual insights into the trade-off between classification (or regression) error and the fraction of input points for which the model makes predictions versus those it rejects. The paper and formalization of error-reject curves informs practical system design, especially for critical applications where errors incur high cost.

1. Definition and Mathematical Foundation

Error-reject curves plot a performance metric (typically accuracy or error rate) on the subset of data not rejected by the model, against the fraction of non-rejected (retained) data points as the rejection threshold varies. The reject option enables a classifier or regressor to abstain—yielding a “don’t know” output—when its internal confidence or certainty measure is below a chosen threshold.

For data set XX, with associated ground truth labels, and a certainty function r(x)r(x) (higher r(x)r(x) signals greater confidence):

  • Let L|L| be the count of correctly classified points; Lθ|\mathcal{L}_\theta| the number of correctly classified but rejected points (“false rejects”); Xθ|X_\theta| is the number of non-rejected samples at threshold θ\theta.
  • The ARC is parameterized as:

ta(θ)=LLθXθt_a(\theta) = \frac{|L| - |\mathcal{L}_\theta|}{|X_\theta|}

tc(θ)=XθXt_c(\theta) = \frac{|X_\theta|}{|X|}

Here, ta(θ)t_a(\theta) is the retained-set accuracy and tc(θ)t_c(\theta) is coverage (fraction retained).

The concept generalizes to regression, multiclass, or cost-sensitive settings, where the performance metric may be mean squared error (with rejection) or precision/recall among the unrejected points.

2. Reject Option Strategies and Certainty Measures

A core step is defining the model’s certainty function r(x)r(x), which quantifies the confidence in prediction for each xx:

  • Bayes confidence: maxjP(jx)\max_j P(j|x), where ground truth posteriors are available.
  • Empirical/model confidence: maxjP^(jx)\max_j \hat{P}(j|x), estimated from the classifier’s outputs.
  • Relative similarity (RelSim): Used in prototype-based classifiers, e.g.,

RelSim(x)=dd+d+d+\mathrm{RelSim}(x) = \frac{d^{-} - d^{+}}{d^{-} + d^{+}}

where d+d^{+} and dd^{-} are distances to the closest prototype of the predicted and other classes, respectively.

Two principal thresholding strategies are used:

  • Global threshold: A single scalar threshold θ\theta is applied to r(x)r(x) for all xx.
  • Local thresholds: Separate thresholds θj\theta_j are assigned per region (e.g., per Voronoi cell or per class), permitting finer control in heterogeneous data spaces.

The preferred choice depends on the data distribution and classifier homogeneity; local strategies are often more effective for models where the certainty scale varies across input regions.

3. Optimization Algorithms for Threshold Selection

Error-reject curves can be optimized by selecting thresholds that maximize true rejects (correctly rejected misclassifications) while minimizing false rejects (incorrectly rejected correct predictions). Key algorithms include:

  • Global optimization: Sort samples by r(x)r(x), scan over candidate thresholds, and select those yielding best accuracy/coverage trade-off. The complexity is O(NlogN)O(N \log N).
  • Local (region-wise) optimization: The optimal vector of thresholds for regions or prototypes is obtained via dynamic programming (DP) (with computational complexity O(LξmaxjΘj)O(|L| \cdot \xi \cdot \max_j |\Theta_j|)), where ξ\xi is the number of regions, or a greedy linear-time approximation (O(Lξ)O(|L| \cdot \xi)), which achieves near-optimal empirical results.

For regression settings, the analogous process involves thresholding estimated conditional variances or entropy metrics, with thresholds determined in a semi-supervised, plug-in fashion to control the rejection rate precisely (Denis et al., 2020, 2503.23782).

4. Practical Applications and Performance Evaluation

Error-reject curves are essential in:

  • Safety-critical applications: Medical diagnostics, fault detection, open-set recognition, and cost-sensitive fields, where silent abstention is preferable to erroneous outputs.
  • Prototype-based classifiers: Methods such as Learning Vector Quantization (LVQ) and its variants implement ARCs effectively, with local reject options yielding performance nearly matching that of support vector machines (SVMs), especially for simpler, more interpretable models (1503.06549).
  • Multiclass and set-valued prediction: Reject and refine options in multicategory classification extend the curve concept to cases where partial abstention (predicting a set of plausible labels) bridges the gap between full prediction and total rejection (1701.02265).
  • Regression and distributional regression: Error-reject methodologies guide abstention on uncertain predictions, achieving reduction in mean squared (or CRPS) error among accepted points, with the ability to precisely fix the rejection fraction (Denis et al., 2020, 2503.23782).

Performance evaluation involves:

  • Benchmarking curves: Artificial data (e.g., Gaussian clusters) enables comparison to the Bayes-optimal reject curve. Real datasets (UCI sets, medical data) corroborate the improved trade-off from local and adaptive strategies.
  • Empirical findings: On heterogeneous or hard data, local rejection often brings the ARC close to the theoretical optimum; on flexible or well-calibrated models, the added benefit is less, suggesting that careful matching of complexity and reject mechanism is important (1503.06549).

5. Extensions: Precision/Recall Reject Curves and Imbalanced Data

The ARC may be suboptimal for imbalanced or application-specific settings. Precision-reject and recall-reject curves have been formalized as: PRC(θ)=TPθTPθ+FPθRRC(θ)=TPθTPθ+FNθ\text{PRC}(\theta) = \frac{\mathit{TP}_\theta}{\mathit{TP}_\theta + \mathit{FP}_\theta} \qquad \text{RRC}(\theta) = \frac{\mathit{TP}_\theta}{\mathit{TP}_\theta + \mathit{FN}_\theta} where metrics are calculated over the retained set XθX_\theta (Fischer et al., 2023). These curves more truthfully characterize classifier utility when positive class detection is paramount (as in medicine or rare event prediction), and mitigate misleading impressions from high accuracy in severely imbalanced scenarios.

Curve Formula Best For
ARC TP+TNTP+FP+TN+FN\frac{TP + TN}{TP+FP+TN+FN} Balanced/general
PRC TPTP+FP\frac{TP}{TP+FP} Precision/minority class
RRC TPTP+FN\frac{TP}{TP+FN} Recall/sensitivity

6. Impact, Limitations, and Future Directions

Error-reject curves provide a standardized and interpretable assessment of rejection strategies across all model classes and application domains. They enable model designers to select operating points that match application tolerance for risk and abstention. Local and adaptive thresholding handle data heterogeneity and model bias, ensuring nearly optimal error-abstention trade-offs.

The effectiveness of ARCs and related curves is influenced by accurate calibration of certainty measures and the granularity of reject thresholds. Extensions to set-valued prediction, cost-sensitive regimes, multiclass, multilabel, and distributional targets have broadened their scope (1701.02265, 2503.23782). Open research areas include further improvements in reject region interpretability, e.g., via semifactual explanations (Artelt et al., 2022), and robust per-class rejection for open-set scenarios (2412.01425).

In conclusion, error-reject curves and their variants are central to the design and deployment of cautious, reliable machine learning systems in practical scenarios, supporting both quantitative evaluation and actionable calibration of abstention mechanisms.