Reliable Accuracy (R-Acc) Metric

Updated 6 May 2026

R-Acc is a composite metric that integrates prediction accuracy with uncertainty calibration, offering a clear measure of model trustworthiness.
It is applied in diverse fields such as probabilistic forecasting, quantum error mitigation, agent evaluations, and recommender systems.
R-Acc methodologies balance scoring rules, conformal calibration, and ICC estimators to achieve optimal trade-offs between sharpness and reliability.

Reliable Accuracy (R-Acc) is a contemporary metric and methodological paradigm that captures both prediction performance (accuracy) and the associated uncertainty calibration (reliability) of computational models. The concept is domain-agnostic but has concrete instantiations in probabilistic forecasting, black-box AI certification, quantum computation, agent-based evaluation, recommender systems, and beyond. R-Acc extends classical accuracy by explicitly integrating statistical consistency and confidence guarantees, thus furnishing a single reportable quantity (or composite metric) with direct implications for trustworthiness in scientific and engineering applications.

1. Formal Definitions and Cross-Domain Scope

R-Acc combines sharpness (accuracy with respect to the ground truth) and calibration (reliability of probabilistic or confidence statements). In classical regression and uncertainty quantification, R-Acc is realized through a convex combination of a proper scoring rule (e.g., CRPS) and a statistical reliability score (RS) as in the Accuracy–Reliability (AR) cost function (Camporeale, 2018). For black-box AI systems, R-Acc manifests as the "reliability level"—the exact marginal coverage rate of the most frequent (mode) answer under self-consistency sampling with conformal calibration (Mouzouni, 24 Feb 2026). In quantum circuits, R-Acc quantifies the proximity of error-mitigated observables to the ideal, accounting for both bias and variance within a certified reliability interval (Aharonov et al., 14 Aug 2025). In agentic multi-trial evaluation, R-Acc emerges as an accuracy estimate modulated by the intraclass correlation coefficient (ICC), quantifying the proportion of variability attributable to genuine task difficulty versus model stochasticity (Mustahsan et al., 7 Dec 2025). In recommender systems, R-Acc fuses measures of prediction and recommendation reliability to reflect alignment between confidence and empirical accuracy (Bobadilla et al., 2024).

2. Core Mathematical Constructs

2.1 General Regression and Forecasting

Given $N$ prediction–observation pairs $(\mu_i, \sigma_i, y^o_i)$ with errors $\epsilon_i = y^o_i - \mu_i$ , the AR cost for Gaussian forecasts is

$\mathrm{AR} = \beta \cdot \frac{1}{N} \sum_{i=1}^N \mathrm{CRPS}_i + (1-\beta) \cdot \mathrm{RS},$

where CRPS and RS are:

$\mathrm{CRPS}_i = \sigma_i \left[ \frac{\epsilon_i}{\sigma_i} \operatorname{erf}\left(\frac{\epsilon_i}{\sqrt{2}\sigma_i}\right) + \sqrt{\frac{2}{\pi}} e^{-\frac{\epsilon_i^2}{2\sigma_i^2}} \right] - \frac{\sigma_i}{\sqrt{\pi}}$
$\mathrm{RS} = \sum_{i=1}^N \left[ \frac{\eta_i}{N}(\operatorname{erf}(\eta_i)+1) - \frac{(2i - 1)\eta_i}{N^2} + \frac{e^{-\eta_i^2}}{\sqrt{\pi}N} \right] - \text{const},\quad \eta_i = \epsilon_i / (\sqrt{2}\sigma_i)$

The hyperparameter $\beta$ is chosen so the two terms contribute equally at their respective minima.

2.2 Black-Box Reliability in AI

For a system subjected to $n$ conformal calibration examples, R-Acc (denoted $1 - \alpha^\star$ ) is the largest confidence level at which the single-answer (mode vote) predictor's coverage is valid:

$1 - \alpha^\star = \frac{|\{\,i : s_i \leq 1\,\}|}{n + 1}$

where $(\mu_i, \sigma_i, y^o_i)$ 0 is the rank of the ground-truth answer in self-consistency samples for query $(\mu_i, \sigma_i, y^o_i)$ 1. This coverage is guaranteed distribution-free with slack at most $(\mu_i, \sigma_i, y^o_i)$ 2 (Mouzouni, 24 Feb 2026).

2.3 Quantum Error Mitigation

Let $(\mu_i, \sigma_i, y^o_i)$ 3 be an error-mitigated estimator of an observable and $(\mu_i, \sigma_i, y^o_i)$ 4 the exact quantum value. R-Acc at level $(\mu_i, \sigma_i, y^o_i)$ 5 is:

$(\mu_i, \sigma_i, y^o_i)$ 6

where $(\mu_i, \sigma_i, y^o_i)$ 7 is the total measured or projected variance, and $(\mu_i, \sigma_i, y^o_i)$ 8 is the Gaussian quantile. This collapses bias and variance into a unified worst-case guarantee (Aharonov et al., 14 Aug 2025).

2.4 Agentic Multi-Trial Evaluations

Given $(\mu_i, \sigma_i, y^o_i)$ 9 tasks and $\epsilon_i = y^o_i - \mu_i$ 0 sampling trials per task:

Compute accuracy $\epsilon_i = y^o_i - \mu_i$ 1 and the decomposition

$\epsilon_i = y^o_i - \mu_i$ 2

where $\epsilon_i = y^o_i - \mu_i$ 3 (between-query) and $\epsilon_i = y^o_i - \mu_i$ 4 (within-query) are estimated from the data. The ICC is $\epsilon_i = y^o_i - \mu_i$ 5. Reliability-adjusted accuracy is interpreted as the accuracy plus its confidence interval, explicitly tracking ICC (Mustahsan et al., 7 Dec 2025).

2.5 Recommender System Reliability

Define reliability–prediction improvement (RPI) and reliability–recommendation improvement (RRI):

$\epsilon_i = y^o_i - \mu_i$ 6

Composite $\epsilon_i = y^o_i - \mu_i$ 7 for $\epsilon_i = y^o_i - \mu_i$ 8 (Bobadilla et al., 2024).

3. Methodologies for Estimation and Calibration

R-Acc calculation depends on the modeling framework and domain:

Probabilistic Regression/Uncertainty Quantification: Parameters for $\epsilon_i = y^o_i - \mu_i$ 9 (or more general distribution families in non-Gaussian extensions) are optimized using the AR or ACCRUE loss via neural nets or polynomial regressors (Camporeale, 2018, Bandy et al., 9 Apr 2026).
Black-Box AI Systems: Reliability certificates are computed through $\mathrm{AR} = \beta \cdot \frac{1}{N} \sum_{i=1}^N \mathrm{CRPS}_i + (1-\beta) \cdot \mathrm{RS},$ 0-fold self-consistency sampling, canonicalization of outputs, and inductive conformal calibration, with empirical guarantee based on exchangeability (Mouzouni, 24 Feb 2026).
Quantum Computing: Characterization batches and runtime allocation are used to minimize variance for a specified bias bound, with R-Acc tracked as an explicit experiment-wide metric (Aharonov et al., 14 Aug 2025).
Agentic Evaluations: Multiple (8–64) trials per task are recommended for stable ICC estimation. R-Acc is not collapsed to a single number but always paired with reporting of CI width and ICC (Mustahsan et al., 7 Dec 2025).
Recommender Systems: RPI and RRI are batch-computed for each candidate reliability signal; online deployment leverages these scores for filtering, ranking, and user feedback (Bobadilla et al., 2024).

4. Theoretical Guarantees and Properties

All R-Acc instantiations are constructed to enforce nontrivial guarantees:

Distribution-Free Marginal Coverage: Conformal calibration ensures that reported R-Acc is an upper bound for the true probability of correctness, up to $\mathrm{AR} = \beta \cdot \frac{1}{N} \sum_{i=1}^N \mathrm{CRPS}_i + (1-\beta) \cdot \mathrm{RS},$ 1 slack (Mouzouni, 24 Feb 2026).
Optimal Trade-Off: $\mathrm{AR} = \beta \cdot \frac{1}{N} \sum_{i=1}^N \mathrm{CRPS}_i + (1-\beta) \cdot \mathrm{RS},$ 2-weighting in AR and ACCRUE loss guarantees that neither sharpness nor calibration dominates, avoiding pathological over-fitting or under-confidence (Camporeale, 2018, Bandy et al., 9 Apr 2026).
Variance Scaling: In quantum mitigation, R-Acc scaling is characterized by exponential dependence on active volume and gate infidelity, dictating the regimes where certified high reliability is attainable (Aharonov et al., 14 Aug 2025).
Stability: In agentic evaluations, high ICC is required for trustworthy gains; otherwise accuracy improvements may be illusory (Mustahsan et al., 7 Dec 2025).
Interpretation: R-Acc quantitatively distinguishes models that "know what they know" from those whose high accuracy arises by chance or overconfident miscalibration.

5. Empirical Performance and Application Benchmarks

5.1 Black-Box AI

R-Acc values for leading LLMs on arithmetic and factual benchmarks are:

Model	Task	R-Acc (%)	Calibration Size
GPT-4.1	GSM8K	94.6	$\mathrm{AR} = \beta \cdot \frac{1}{N} \sum_{i=1}^N \mathrm{CRPS}_i + (1-\beta) \cdot \mathrm{RS},$ 3
GPT-4.1	TruthfulQA	96.8	$\mathrm{AR} = \beta \cdot \frac{1}{N} \sum_{i=1}^N \mathrm{CRPS}_i + (1-\beta) \cdot \mathrm{RS},$ 4
GPT-4.1-nano	GSM8K	89.8	$\mathrm{AR} = \beta \cdot \frac{1}{N} \sum_{i=1}^N \mathrm{CRPS}_i + (1-\beta) \cdot \mathrm{RS},$ 5
GPT-4.1-nano	MMLU	66.5	$\mathrm{AR} = \beta \cdot \frac{1}{N} \sum_{i=1}^N \mathrm{CRPS}_i + (1-\beta) \cdot \mathrm{RS},$ 6

These are strict lower bounds with coverage validated on additional test data (Mouzouni, 24 Feb 2026). Conditional coverage on solvable items consistently exceeds 0.93.

5.2 Quantum Error Mitigation

QESEM achieves R-Acc $\mathrm{AR} = \beta \cdot \frac{1}{N} \sum_{i=1}^N \mathrm{CRPS}_i + (1-\beta) \cdot \mathrm{RS},$ 7 for Kicked-Ising and VQE energy targets (103–qubit and molecular cases), whereas zero-noise extrapolation methods yield consistently lower and systematically biased scores (Aharonov et al., 14 Aug 2025).

5.3 Agentic Evaluation

ICC values for multi-trial accuracy evaluation:

Task & Model	Accuracy (%)	ICC
GAIA L1, GPT-5 Search	62.3	0.774
GAIA L3, GPT-5 Search	44.2	0.629
FRAMES, GPT-4o Search	63.5	0.735

Stability at $\mathrm{AR} = \beta \cdot \frac{1}{N} \sum_{i=1}^N \mathrm{CRPS}_i + (1-\beta) \cdot \mathrm{RS},$ 8–16 (structured) up to $\mathrm{AR} = \beta \cdot \frac{1}{N} \sum_{i=1}^N \mathrm{CRPS}_i + (1-\beta) \cdot \mathrm{RS},$ 9 (complex reasoning) is recommended for convergence (Mustahsan et al., 7 Dec 2025).

5.4 Recommender Systems

KNN-variability and fast-resample reliability measures produce RPI up to 25% MAE reduction and RRI up to 60% over random baselines on MovieLens 1M and Netflix (Bobadilla et al., 2024).

6. Deployment, Tuning, and Practical Considerations

Deployment of R-Acc involves:

Model/Task Drift: Periodic recalibration on fresh batches is necessary to maintain reliability guarantees in dynamic or distribution-shifting environments (Mouzouni, 24 Feb 2026, Aharonov et al., 14 Aug 2025).
Sequential Stopping: Sample-efficient reliability certification using sequential stopping reduces required sampling rates, preserving coverage (Mouzouni, 24 Feb 2026).
Reporting Standards: Mandatory co-reporting of point accuracy, reliability metric (AR/RS/ICC), and uncertainty bands is recommended to avoid misinterpretation of single-run accuracy improvements (Mustahsan et al., 7 Dec 2025).
Hyperparameter Tuning: $\mathrm{CRPS}_i = \sigma_i \left[ \frac{\epsilon_i}{\sigma_i} \operatorname{erf}\left(\frac{\epsilon_i}{\sqrt{2}\sigma_i}\right) + \sqrt{\frac{2}{\pi}} e^{-\frac{\epsilon_i^2}{2\sigma_i^2}} \right] - \frac{\sigma_i}{\sqrt{\pi}}$ 0 (trade-off parameter) is selected via grid search on a validation split or using theoretical minima alignment (Camporeale, 2018, Bandy et al., 9 Apr 2026).

In recommender systems, offline analysis is supplemented by real-time reliability filtering and interface transparency, with continuous reevaluation on holdout splits (Bobadilla et al., 2024).

7. Interpretations, Open Problems, and Future Directions

R-Acc's interiority to high-stakes and robust machine learning remains active research terrain:

The generalization from scalar Gaussian to non-Gaussian (e.g., two-piece Gaussian, asymmetric Laplace) error laws shows improved tail calibration at the cost of analytic complexity (Bandy et al., 9 Apr 2026).
For agentic systems, convergence and interpretability of ICC under adversarial noise or unbalanced task distributions pose open challenges (Mustahsan et al., 7 Dec 2025).
In quantum settings, projections suggest scalable maintenance of R-Acc $\mathrm{CRPS}_i = \sigma_i \left[ \frac{\epsilon_i}{\sigma_i} \operatorname{erf}\left(\frac{\epsilon_i}{\sqrt{2}\sigma_i}\right) + \sqrt{\frac{2}{\pi}} e^{-\frac{\epsilon_i^2}{2\sigma_i^2}} \right] - \frac{\sigma_i}{\sqrt{\pi}}$ 1 with improved hardware (lower gate infidelity and larger active volume), potentially demarcating the threshold of quantum computational advantage (Aharonov et al., 14 Aug 2025).
A plausible implication is that as AI systems, quantum devices, and recommender systems become more integral to decision-making, R-Acc or directly analogous reliability metrics will supplant raw accuracy as the primary trust criterion.

R-Acc thus offers a mathematically grounded, empirically validated foundation for reporting and acting on model outputs in any domain where certainty and validity—not just performance—matter.