Error-Reject Curves

Updated 1 July 2025

Error-reject curves evaluate machine learning models with a reject option by plotting performance on non-rejected data against the fraction of data retained as the rejection threshold varies.
These curves are crucial in safety-critical applications like medical diagnostics where abstaining from uncertain predictions is preferable to making a potentially costly error.
Analyzing variants such as precision or recall-reject curves provides more application-specific insights, particularly for imbalanced datasets where standard accuracy can be misleading.

Error-reject curves, also known as accuracy-reject curves (ARCs), are fundamental tools for evaluating and calibrating machine learning models equipped with a reject option—where the model abstains from prediction in cases of high uncertainty. These curves provide quantitative and visual insights into the trade-off between classification (or regression) error and the fraction of input points for which the model makes predictions versus those it rejects. The paper and formalization of error-reject curves informs practical system design, especially for critical applications where errors incur high cost.

1. Definition and Mathematical Foundation

Error-reject curves plot a performance metric (typically accuracy or error rate) on the subset of data not rejected by the model, against the fraction of non-rejected (retained) data points as the rejection threshold varies. The reject option enables a classifier or regressor to abstain—yielding a “don’t know” output—when its internal confidence or certainty measure is below a chosen threshold.

For data set $X$ , with associated ground truth labels, and a certainty function $r(x)$ (higher $r(x)$ signals greater confidence):

Let $|L|$ be the count of correctly classified points; $|\mathcal{L}_\theta|$ the number of correctly classified but rejected points (“false rejects”); $|X_\theta|$ is the number of non-rejected samples at threshold $\theta$ .
The ARC is parameterized as:

$t_a(\theta) = \frac{|L| - |\mathcal{L}_\theta|}{|X_\theta|}$

$t_c(\theta) = \frac{|X_\theta|}{|X|}$

Here, $t_a(\theta)$ is the retained-set accuracy and $t_c(\theta)$ is coverage (fraction retained).

The concept generalizes to regression, multiclass, or cost-sensitive settings, where the performance metric may be mean squared error (with rejection) or precision/recall among the unrejected points.

2. Reject Option Strategies and Certainty Measures

A core step is defining the model’s certainty function $r(x)$ , which quantifies the confidence in prediction for each $x$ :

Bayes confidence: $\max_j P(j|x)$ , where ground truth posteriors are available.
Empirical/model confidence: $\max_j \hat{P}(j|x)$ , estimated from the classifier’s outputs.
Relative similarity (RelSim): Used in prototype-based classifiers, e.g.,

$\mathrm{RelSim}(x) = \frac{d^{-} - d^{+}}{d^{-} + d^{+}}$

where $d^{+}$ and $d^{-}$ are distances to the closest prototype of the predicted and other classes, respectively.

Two principal thresholding strategies are used:

Global threshold: A single scalar threshold $\theta$ is applied to $r(x)$ for all $x$ .
Local thresholds: Separate thresholds $\theta_j$ are assigned per region (e.g., per Voronoi cell or per class), permitting finer control in heterogeneous data spaces.

The preferred choice depends on the data distribution and classifier homogeneity; local strategies are often more effective for models where the certainty scale varies across input regions.

3. Optimization Algorithms for Threshold Selection

Error-reject curves can be optimized by selecting thresholds that maximize true rejects (correctly rejected misclassifications) while minimizing false rejects (incorrectly rejected correct predictions). Key algorithms include:

Global optimization: Sort samples by $r(x)$ , scan over candidate thresholds, and select those yielding best accuracy/coverage trade-off. The complexity is $O(N \log N)$ .
Local (region-wise) optimization: The optimal vector of thresholds for regions or prototypes is obtained via dynamic programming (DP) (with computational complexity $O(|L| \cdot \xi \cdot \max_j |\Theta_j|)$ ), where $\xi$ is the number of regions, or a greedy linear-time approximation ( $O(|L| \cdot \xi)$ ), which achieves near-optimal empirical results.

For regression settings, the analogous process involves thresholding estimated conditional variances or entropy metrics, with thresholds determined in a semi-supervised, plug-in fashion to control the rejection rate precisely (Denis et al., 2020, Zaoui et al., 31 Mar 2025).

4. Practical Applications and Performance Evaluation

Error-reject curves are essential in:

Safety-critical applications: Medical diagnostics, fault detection, open-set recognition, and cost-sensitive fields, where silent abstention is preferable to erroneous outputs.
Prototype-based classifiers: Methods such as Learning Vector Quantization (LVQ) and its variants implement ARCs effectively, with local reject options yielding performance nearly matching that of support vector machines (SVMs), especially for simpler, more interpretable models (Fischer et al., 2015).
Multiclass and set-valued prediction: Reject and refine options in multicategory classification extend the curve concept to cases where partial abstention (predicting a set of plausible labels) bridges the gap between full prediction and total rejection (Zhang et al., 2017).
Regression and distributional regression: Error-reject methodologies guide abstention on uncertain predictions, achieving reduction in mean squared (or CRPS) error among accepted points, with the ability to precisely fix the rejection fraction (Denis et al., 2020, Zaoui et al., 31 Mar 2025).

Performance evaluation involves:

Benchmarking curves: Artificial data (e.g., Gaussian clusters) enables comparison to the Bayes-optimal reject curve. Real datasets (UCI sets, medical data) corroborate the improved trade-off from local and adaptive strategies.
Empirical findings: On heterogeneous or hard data, local rejection often brings the ARC close to the theoretical optimum; on flexible or well-calibrated models, the added benefit is less, suggesting that careful matching of complexity and reject mechanism is important (Fischer et al., 2015).

5. Extensions: Precision/Recall Reject Curves and Imbalanced Data

The ARC may be suboptimal for imbalanced or application-specific settings. Precision-reject and recall-reject curves have been formalized as: $\text{PRC}(\theta) = \frac{\mathit{TP}_\theta}{\mathit{TP}_\theta + \mathit{FP}_\theta} \qquad \text{RRC}(\theta) = \frac{\mathit{TP}_\theta}{\mathit{TP}_\theta + \mathit{FN}_\theta}$ where metrics are calculated over the retained set $X_\theta$ (Fischer et al., 2023). These curves more truthfully characterize classifier utility when positive class detection is paramount (as in medicine or rare event prediction), and mitigate misleading impressions from high accuracy in severely imbalanced scenarios.

Curve	Formula	Best For
ARC	$\frac{TP + TN}{TP+FP+TN+FN}$	Balanced/general
PRC	$\frac{TP}{TP+FP}$	Precision/minority class
RRC	$\frac{TP}{TP+FN}$	Recall/sensitivity

6. Impact, Limitations, and Future Directions

Error-reject curves provide a standardized and interpretable assessment of rejection strategies across all model classes and application domains. They enable model designers to select operating points that match application tolerance for risk and abstention. Local and adaptive thresholding handle data heterogeneity and model bias, ensuring nearly optimal error-abstention trade-offs.

The effectiveness of ARCs and related curves is influenced by accurate calibration of certainty measures and the granularity of reject thresholds. Extensions to set-valued prediction, cost-sensitive regimes, multiclass, multilabel, and distributional targets have broadened their scope (Zhang et al., 2017, Zaoui et al., 31 Mar 2025). Open research areas include further improvements in reject region interpretability, e.g., via semifactual explanations (Artelt et al., 2022), and robust per-class rejection for open-set scenarios (Yan et al., 2 Dec 2024).

In conclusion, error-reject curves and their variants are central to the design and deployment of cautious, reliable machine learning systems in practical scenarios, supporting both quantitative evaluation and actionable calibration of abstention mechanisms.