Accuracy-Rejection AUC Scores & Metrics

Updated 30 September 2025

Accuracy-rejection AUC scores are metrics that quantify classifier performance by integrating selective classification and rejection thresholds.
They extend standard AUC by incorporating uncertainty quantification and calibration to better address imbalanced and high-stakes evaluation scenarios.
Advanced methods such as PAC-Bayesian frameworks and quantile-based calibration drive improvements in model safety and reliable risk coverage.

Accuracy-rejection AUC scores summarize the trade-off between prediction accuracy and the probability of abstaining from uncertain classifications (“rejection”) across varying decision thresholds, particularly in selective classification, safety-critical, and clinical settings. This family of metrics and analytical tools extends or refines the classical area under the receiver operating characteristic curve (AUC), aiming to address specific limitations of traditional AUC in operational scenarios where uncertainty, class imbalance, or selective abstention is critical. Approaches in this domain span PAC-Bayesian frameworks, surrogate loss direct optimization, selective classification algorithms with formal guarantees, advanced rejection calibration, and newer multi-threshold risk coverage metrics.

1. Fundamental Concepts

Accuracy-rejection AUC scores generalize the standard AUC by quantifying classifier performance not only in discriminating between classes but also in the presence of a rejection option or varying uncertainty thresholds. The classic AUC is defined as the probability that a randomly chosen positive is ranked higher than a randomly chosen negative: $\mathrm{AUC} = \mathbb{P}\left[s(X^+) > s(X^-)\right]$ where $s(\cdot)$ is the classifier’s score function.

In selective classification (or classification with a reject option), the classifier abstains on uncertain predictions, trading off coverage (fraction of accepted predictions) with accuracy. Here, performance is quantified using accuracy-rejection curves, accuracy-coverage curves, area under selective risk-coverage curves, or similar constructs. Some frameworks rigorously optimize the empirical AUC on the accepted subset, while others introduce generalized AUC-like metrics to capture risk under rejection mechanisms.

The motivation for these approaches includes:

Safe deployment in medical diagnostics and high-stakes applications
Model calibration for actionable decision-making, beyond mere ranking
Integration of explicit uncertainty quantification and abstention

2. Selective Classification and Rejection Mechanisms

Selective classification frameworks pair a base classifier with a selection (reject) function $g(x)$ , accepting predictions only where the model is sufficiently confident, and refusing (abstaining) otherwise. Let coverage $c$ denote the accepted proportion of instances, and let the selective accuracy or selective AUC be computed only over accepted predictions.

Formal developments in this area involve:

Analytical conditions under which rejecting specific instances will improve AUC on the accepted subset; for example, abstaining on instances whose scores fall in theoretically derived intervals where their participation in pairwise ranking would likely degrade observed AUC (Pugnana et al., 2022).
The AUCROSS algorithm, which uses cross-fitting and quantile estimation to calibrate optimal acceptance thresholds so as to maximize the AUC at a given minimum-coverage constraint (Pugnana et al., 2022).

In high-stakes medical imaging, empirical studies demonstrate that entropy-based and confidence-interval-based rejection mechanisms—accepting only predictions with low entropy or narrow confidence intervals—lead to substantial improvements in AUC among accepted cases, though at lower overall coverage rates (Aperstein et al., 12 Sep 2025). Such strategies are typically calibrated using quantile-based procedures on held-out data, and may be optimized globally or per-class to reflect the risk profiles of individual pathologies.

3. Alternative and Generalized Accuracy-Rejection Metrics

Several recognized limitations of the standard AUC in selective answering and rejection contexts have motivated the development of alternative or generalized accuracy-rejection metrics:

Tail Sensitivity: Standard AUC weights all thresholds equally, potentially misrepresenting performance in the high-confidence, low-rejection region that matters most for safety-critical deployment (Mishra et al., 2022, Škvára et al., 2023).
Curve Pathologies: Non-monotonicity in accuracy-coverage curves (i.e., accuracy increasing with coverage) and plateaus at high coverage in modern architectures (transformers) can render AUC misleading for selective answering (Mishra et al., 2022).
Composite Safety Metrics: Metrics such as DiSCA, DiDMA, and NiDMA explicitly combine elements: the maxprob threshold at first error, minimum coverage threshold for acceptable accuracy, penalization of curve fluctuations, and optionally model energy usage or out-of-distribution performance (Mishra et al., 2022).

The Area under the Generalized Risk Coverage curve (AUGRC) (Traub et al., 1 Jul 2024) addresses drawbacks of pointwise or single-threshold metrics by integrating the joint probability of accepting a prediction and that prediction being erroneous, over all possible acceptance thresholds: $\mathrm{AUGRC} = \int_0^1 P(Y_f = 1,\, g(x) \geq \tau) \, dP(g(x) \geq \tau)$ where $Y_f$ indicates misclassification. AUGRC provides a direct measure of the average risk of undetected failures across the rejection–acceptance spectrum, delivering an interpretable and robust benchmark for selective models.

Metric	Coverage Type	Key Feature
Standard AUC	All thresholds	Ranks positive above negative
Selective AUC	Accepted region	Excludes rejected predictions
AUGRC	Aggregate	Integrates risk-coverage jointly
Partial AUC	FPR ≤ α	Focuses on critical FPR region
DiSCA, DiDMA	Tail, cost, OOD	Penalize early errors, model cost

4. Properness, Calibration, and Operational Alignment

Properness of a scoring metric ensures that it incentivizes honest reporting of probabilistic beliefs; in this context, improper AUC scoring can lead practitioners to “game” model outputs to optimize empirical AUC while distorting true uncertainty (Byrne, 2015). AUC may be not strictly proper unless the number of positive examples is fixed, classes are independent, or certain conditional invariance conditions hold.

Calibration plays a crucial role in bridging the gap between the ranking-centric AUC and actual binary decision-making accuracy. When applying a threshold for deployment, models with identical AUC can yield substantially different accuracies due to calibration discrepancies. Methods such as Platt scaling, isotonic regression, or quantile-based confidence calibration can partially reconcile these differences (Opitz, 4 Apr 2024, Aperstein et al., 12 Sep 2025).

5. Advanced Computational and Statistical Methods

PAC-Bayesian approaches construct a pseudo-posterior over score parameters, leveraging the empirical AUC risk as an exponential penalty. Non-asymptotic bounds on discriminatory risk are derived, balancing empirical performance and Kullback–Leibler divergence to the prior (complexity penalty). When employing spike-and-slab priors, feature sparsity is encouraged, providing robustness and interpretability in high dimensions (Ridgway et al., 2014).

Direct optimization of the partial AUC—particularly in the low-FPR regime—is supported by generative and deep neural scoring functions, which facilitate differentiable approximations of the non-continuous pairwise AUC objective (Ueda et al., 2018). For high-dimensional, sparse, streamed data, algorithms such as FTRL-AUC use empirical saddle-point reformulation with lazy O(k)-per-iteration updates for online selective learning (Zhou et al., 2020).

Influence function–based variance estimation enables efficient uncertainty quantification for accuracy-rejection metrics under missing data, and empirical studies confirm that robust metric estimation in rare-event settings depends on the absolute number of events rather than the nominal event rate (Choudhury et al., 2017, Minus et al., 22 Apr 2025).

6. Limitations, Pitfalls, and Practical Considerations

Despite their widespread usage, accuracy-rejection AUC scores can be misleading or unstable:

In rare-event scenarios, AUC estimates are reliable if the effective sample size (minimum class count) is sufficiently large; otherwise, bias and variance can be substantial (Minus et al., 22 Apr 2025).
For discrete or binary predictors, the choice of interpolation (linear vs. step function) when constructing the ROC curve significantly affects the AUC and its interpretability, with some software defaulting to overoptimistic estimations (Muschelli, 2019).
If evaluation sets are not representative—e.g., anomalous test data in anomaly detection differ from operational anomalies—then AUC and its variants may provide a false sense of comfort (Škvára et al., 2023).

Practical implementation of accuracy-rejection AUC scores in clinical and safety-critical contexts requires joint optimization of calibration, rejection mechanisms, and diagnostic performance criteria. Empirical benchmarks using multi-label data, quantile-calibrated rejection thresholds, and class-specific abstention strategies demonstrate improved reliability and actionable performance (Aperstein et al., 12 Sep 2025).

7. Outlook and Research Directions

Ongoing areas of research and anticipated future trends include:

Development of multi-threshold or groupwise metrics (e.g., deep ROC analysis, normalized partial/concordant AUCs) that better reflect performance in subpopulations or critical operational regions (Carrington et al., 2021).
Extension of selective AUC frameworks to multi-class settings and integration with fairness constraints (Pugnana et al., 2022).
Adaptive learning and evaluation protocols, including active/few-shot anomaly specification and robust open-set recognition metrics that link known and unknown class performance in a coupled manner (e.g., OpenAUC) (Wang et al., 2022).
Automated toolkits and pipelines for clinical deployment, facilitating transparent and reproducible calibration and evaluation of selective and uncertainty-aware classifiers.

In sum, accuracy-rejection AUC scores represent an overview of classical discriminative ranking theory with modern developments in selective prediction, uncertainty quantification, and operational calibration. Their continued evolution is essential for the reliable and safe deployment of machine learning systems in settings where abstention and confidence-adapted decision-making are paramount.