Uncertainty-aware Accuracy (UAcc)
- Uncertainty-aware Accuracy (UAcc) is a metric that combines prediction correctness with uncertainty quantification to reward models that are confidently correct and appropriately uncertain when wrong.
- Different UAcc variants use approaches like thresholded confusion matrices, conformal prediction sets, and divergence-based trade-offs to balance calibration and informativeness.
- Empirical studies demonstrate that UAcc improves model evaluation in high-risk fields by penalizing overconfident errors and guiding selective abstention and ensemble methods.
Uncertainty-aware Accuracy (UAcc) is a class of performance metrics that integrate model prediction correctness and quantified uncertainty into a unified evaluation. UAcc metrics have been developed independently across several domains—including classification, selective prediction, vision–language modeling, LLM evaluation, and automated medical diagnosis—to address the limitations of standard accuracy, which does not penalize models for overconfidence, nor reward them for appropriately abstaining or expressing uncertainty in ambiguous or high-risk scenarios. Existing variants of UAcc take the form of thresholded confusion-matrix-based indices, conformal-prediction set-based summaries, information-theoretic or divergence-based trade-off scores, or abstention-curve functions, but all emphasize the alignment between model self-reported uncertainty and actual predictive performance.
1. Foundational Definitions and Formulae
Thresholded Confusion Matrix (Binary UAcc)
Several influential works employ a confusion-matrix formulation, cross-tabulating each prediction by correctness (match to ground truth) and binarized uncertainty (e.g., predictive entropy below or above a threshold) (Asgharnezhad et al., 12 Jun 2025, Mendes et al., 2024, Krishnan et al., 2020):
Let each test example be assigned:
- Correct & Certain (CC)
- Correct & Uncertain (CU)
- Incorrect & Certain (IC)
- Incorrect & Uncertain (IU)
The Uncertainty-aware Accuracy is defined as: Here, is the count of correct-and-certain predictions, is incorrect-and-uncertain, and so on. UAcc rewards models that are confident when correct and express high uncertainty when incorrect, directly penalizing overconfident errors and underconfident correct predictions (Krishnan et al., 2020, Asgharnezhad et al., 12 Jun 2025, Mendes et al., 2024).
Set-based Conformal UAcc (Multiclass/Set-valued Outputs)
For multiclass or set-valued tasks, especially with conformal prediction or abstention wrappers, UAcc is defined via the informativeness and coverage of the returned prediction sets (Karim et al., 19 Sep 2025, Kostumov et al., 2024, Ye et al., 2024):
- Let be the set-valued prediction, with its cardinality and .
- The set-based UAcc (as in (Karim et al., 19 Sep 2025)) is:
- Alternative: Assign each correct (true label in set) case a credit $1/|C(x)|$ and average: This penalizes large, uninformative sets and standardizes scores across tasks (Ye et al., 2024).
Trade-off (Imprecision-Accuracy) Objective for Credal/Evidential Models
For credal-set and belief function models, UAcc generalizes to include both a best-case divergence-to-target and a non-specificity (imprecision) penalty (Manchingal et al., 28 Jan 2025): where is the minimum KL divergence to over vertices of and quantifies the log-cardinality-weighted size of credal mass assignments, with controlling the trade-off.
2. Motivation, Intuition, and Theoretical Properties
UAcc is motivated by the demand for models that “know what they know and know what they don’t.” Classic accuracy or coverage-based metrics ignore whether a model is correctly calibrated: overconfident errors (“confident and wrong”) are treated identically to uncertain errors, and abstention or set-valued outputs are not directly rewarded when reducing risk. UAcc or similar metrics penalize:
- Overconfident misclassifications (IC)
- Underconfident correct classifications (CU)
- Set-valued predictions that "hedge" by including too many candidates
The ideal UAcc-optimized model is one that is certain and correct, and only uncertain—or abstaining—in high-risk or ambiguous cases.
Theoretical properties of UAcc variants include:
- Boundedness: Typically , unless normalized by set size, in which case before scaling (Kostumov et al., 2024).
- Coverage guarantees: In conformal-prediction use, UAcc cannot exceed the nominal coverage.
- Model-agnosticism: UAcc operates on outputs from any classifier, provided uncertainty estimates are available.
3. Implementation and Computation
Binary Thresholded UAcc
- For each instance, compute:
- Predicted class
- Uncertainty score (e.g., entropy, confidence)
- Choose uncertainty threshold (e.g., validated or domain-specific).
- Categorize each instance as CC, CU, IU, or IC.
- Tally counts and compute . (Krishnan et al., 2020, Asgharnezhad et al., 12 Jun 2025, Mendes et al., 2024)
Set-valued/Conformal UAcc
- For output probabilities, compute conformal prediction sets at desired risk (e.g., for 90% coverage).
- On the test set:
- For each example, check if , record .
- Aggregate coverage (fraction with ) and average set size.
- Calculate UAcc via the appropriate formula (e.g., ). (Manchingal et al., 28 Jan 2025, Karim et al., 19 Sep 2025, Ye et al., 2024, Kostumov et al., 2024)
Rejection/Abstention UAcc
For Dirichlet wrappers or selective classification approaches:
- Sort predictions by uncertainty.
- For each rejection threshold , compute accuracy over retained instances : and plot UAcc versus coverage, or average as a scalar summary. (Mena et al., 2019)
4. Use Cases and Model Comparison
UAcc metrics have been deployed in:
- Medical image diagnosis, assessing DL classifier reliability with ensemble, MC Dropout, and EMCD approaches. Ensembles consistently deliver higher UAcc; MC Dropout increases uncertainty sensitivity but can lower UAcc (Asgharnezhad et al., 12 Jun 2025).
- LLM benchmarks and Vision-LLMs evaluated via conformal prediction sets, demonstrating that high top-1 accuracy does not guarantee high UAcc if uncertainty sets are large or uninformative (Kostumov et al., 2024, Ye et al., 2024, Karim et al., 19 Sep 2025).
- Selective classification and API audit scenarios, where rejecting high-uncertainty cases can yield substantial UAcc improvements under distributional or domain shift (Mena et al., 2019).
- In error-driven training and Bayesian networks, where UAcc serves as both a validation and training objective for explicitly shaping the model’s uncertainty-accuracy coupling (Krishnan et al., 2020, Mendes et al., 2024).
The following table summarizes key UAcc formulae and their primary context:
| UAcc Variant | Formula / Key Expression | Use Case Domain |
|---|---|---|
| Binary thresholded UAcc | Classification, uncertainty threshold | |
| Set-valued conformal UAcc | Multiclass, conformal prediction | |
| Credit-per-set-size UAcc | LLMs, VQA, confidence sets | |
| Distance+imprecision UAcc | Credal, epistemic, belief-function models | |
| UAcc(rejection rate ) | Selective/rejection classifiers |
5. Empirical Findings and Best Practices
Empirical studies demonstrate that UAcc is sensitive to both model calibration and uncertainty alignment:
- Deep ensembles and well-tuned thresholding generally outperform single models and MC Dropout in maximizing UAcc (Asgharnezhad et al., 12 Jun 2025, Krishnan et al., 2020).
- Small, well-calibrated ensembles (5–6 members) provide a maximal UAcc/uncertainty trade-off for clinical deployments (Asgharnezhad et al., 12 Jun 2025).
- For set-valued outputs, models that can reliably produce singleton prediction sets without sacrificing coverage (e.g., Llama-3 8B vs. Qwen 3B in essay scoring tasks) achieve distinctly higher UAcc (Karim et al., 19 Sep 2025).
- UAcc is robust to calibration-split fraction, error rate , and other hyperparameters in LLM evaluation (Ye et al., 2024).
- Selective abstention on even 10–20% of most uncertain cases can recover large fractions of accuracy lost under domain shift (Mena et al., 2019).
6. Limitations and Trade-offs
While UAcc addresses deficiencies of standard accuracy and calibration scores, its application carries trade-offs:
- Binary thresholded UAcc and coverage-based UAcc require selection of an uncertainty threshold, which can be domain-specific and affect comparability (Asgharnezhad et al., 12 Jun 2025, Krishnan et al., 2020).
- Set-size-based UAcc penalizes large sets, but does not distinguish among types of errors or handle ordinal label structure unless extended (Karim et al., 19 Sep 2025).
- For credal and belief models, the choice of imprecision penalty () in UAcc directly shapes which models are favored, requiring careful tuning to reflect domain-specific risk/certainty priorities (Manchingal et al., 28 Jan 2025).
- In clinical/critical settings, UAcc should be interpreted alongside sensitivity/specificity at the selected threshold, as it does not account for class-conditional error costs (Asgharnezhad et al., 12 Jun 2025).
- UAcc is not directly suitable for unconstrained generative tasks or open-ended outputs without further adaptation (Ye et al., 2024, Karim et al., 19 Sep 2025).
- Some UAcc variants can mask per-sample variability (e.g., average set size) or dilute insight when a few samples require very large prediction sets (Karim et al., 19 Sep 2025).
7. Extensions and Open Directions
Proposed and ongoing extensions to UAcc include:
- Ordinal-aware conformal scores to address label structure in grading/rubric-based tasks (Karim et al., 19 Sep 2025).
- Parameterized trade-off tuning (e.g., in credal UAcc) to interpolate between precision and informativeness (Manchingal et al., 28 Jan 2025).
- Ensemble combinations of multiple uncertainty scoring functions to reduce method dependence (Ye et al., 2024).
- Application to structured prediction, sequence modeling, and regression, extending beyond classification (Mendes et al., 2024, Krishnan et al., 2020).
- Exploring UAcc in fully open-set or multi-modal reasoning tasks, particularly for next-generation foundation models and cross-domain benchmarks (Kostumov et al., 2024).
Uncertainty-aware Accuracy thus provides a principled, operational evaluation axis for selecting, calibrating, and deploying predictive models in high-stakes and risk-sensitive environments, complementing and extending the scope of traditional accuracy and calibration metrics.