Nonconformity Scores in Conformal Prediction

Updated 25 February 2026

Nonconformity scores are functions that quantify how atypical a candidate label is relative to observed data and a learned model.
The calibration and design of these scores—including normalization, adaptivity, and energy-based adjustments—directly affect the efficiency and size of prediction sets.
Specialized constructions extend nonconformity scores to domains like clustering, recommender systems, and online adaptations, balancing coverage guarantees with practical informativeness.

A nonconformity score is a function, typically denoted α or S, that quantifies the "strangeness" or ^{^{^{^{1^{^{^{^ity}}}}}}} of a candidate label y relative to a given input x (and learned model f) by comparison to a reference set of observed data. In the conformal prediction (CP) framework, nonconformity scores are central to constructing set- or interval-valued predictions that achieve rigorous statistical guarantees for coverage, in both classification and regression, as well as in more specialized contexts such as clustering and recommender systems. The specific definition of the nonconformity score, and the choice of how to calibrate and adapt it, heavily affects the efficiency and adaptivity of the prediction sets. This article systematically surveys the mathematical definitions, constructions, design considerations, and empirical impacts of nonconformity scores, as well as their extensions and specializations across modern CP methodologies.

1. Foundational Principles and Definitions

A nonconformity score is a measurable function

$S:\mathcal X \times \mathcal Y \to \mathbb R$

that, given a test observation x and a candidate label y, assigns a scalar value indicating how "atypical" y is in the context of the learned model f and the calibration data. In most conformal algorithms, S is constructed so that larger values indicate greater nonconformity. For regression, classical choices include absolute residuals

$S(x, y) = |y - f(x)|$

or normalized residuals

$S(x, y) = \frac{|y - f(x)|}{\hat\sigma(x)}$

with $\hat\sigma(x)$ a local scale predictor. In classification, commonly employed choices are the inverse predicted probability

$S(x, y) = 1 - p(y \mid x)$

the negative log-probability (cross-entropy),

$S(x, y) = -\log p(y \mid x)$

or the margin score,

$S(x, y) = \max_{k \neq y} p(k \mid x) - p(y \mid x)$

Such functions are evaluated on a held-out calibration set to generate a set of "reference" nonconformity scores $\{S(x_i, y_i)\}$ . Given a new observation $x_{\mathrm{test}}$ , one computes the set $\{y : S(x_{\mathrm{test}}, y) \leq q_{1-\alpha}\}$ , where $q_{1-\alpha}$ is the (appropriately corrected) $(1-\alpha)$ quantile of calibration scores, attaining marginal coverage at the specified level under exchangeability.

The design of $S$ —its informativeness and adaptivity to statistical heterogeneity or sample difficulty—directly controls the width or size of the CP prediction set. All coverage guarantees rest not on $S$ 's form, but on the exchangeability of calibration and test samples.

2. Normalized, Adaptive, and Context-aware Nonconformity Scores

Adaptive Normalization: Heteroskedasticity and Instance Difficulty

To enhance efficiency, nonconformity scores are often locally normalized. For regression, this takes the form

$\gamma(x, y) = \frac{|y - f(x)|}{\sigma(x)}$

where $\sigma(x)$ is fit specifically to predict $|y-f(x)|$ (absolute residual). Calibration of $\gamma(x, y)$ yields intervals that shrink for "easy" (low-noise) inputs and widen for "hard" (high-noise) ones, closely matching instance-specific uncertainty (Seedat et al., 2023).

An extension introduces self-supervised features: an auxiliary model is trained using a domain-specific pretext task (e.g., autoencoding or VIME-style feature corruption), and its error $\ell_{ss}(x)$ is concatenated to x when training $\sigma$ . The final normalized score is

$\gamma_{ss}(x,y) = \frac{|y - f(x)|}{\sigma([x, \ell_{ss}(x)])}$

Empirically, adding self-supervised error as a feature to the residual prediction model yields intervals that are more responsive to local data density, providing substantial gains in efficiency and adaptivity, especially in the long-tailed or data-sparse regimes (Seedat et al., 2023).

Context-Aware and Learnable Functions

In robotics and high-dimensional tasks, nonconformity scores constructed as task-agnostic functions can be overly conservative or insufficiently adaptive. Learnable Conformal Prediction (LCP) replaces $s(x, y)$ with a context-sensitive neural function $s_\theta(x, y) = f_\theta(\phi(x, y))$ , extracting geometric, semantic, and problem-specific features. Empirical results across robotics and vision tasks demonstrate that LCP reduces prediction set sizes (e.g., 4.7–9.9% for classification; up to 54% for detection intervals), while maintaining valid coverage (Kumar et al., 26 Sep 2025).

3. Advanced Nonconformity Score Constructions

Singleton-Optimized Scores

Standard score functions primarily target average set size minimization. Singleton-Optimized Conformal Prediction (SOCOP) introduces a score crafted to directly reduce the probability of non-singleton predictions. This construction solves a geometric optimization over monotone set-valued predictors, showing that the optimal nonconformity for the singleton cost is given in closed form as the smallest slope of the lower convex hull of cumulative label probabilities above a given class rank. The resulting algorithm is $O(K)$ in the number of classes and increases singleton prediction frequency by up to 20% with little change in average set size (Wang et al., 28 Sep 2025).

Penalized and Regularized Family

The Penalized Inverse Probability (PIP) and its regularized variant RePIP combine inverse probability and a penalty that aggregates information from higher-probability classes, allowing fine-grained control of the efficiency–informativeness tradeoff. These scores are defined as

$\Delta^{\mathrm{PIP}}(y) = 1 - \hat p^y + \sum_{r=1}^{R(y)-1} \frac{\hat p^{[r]}}{r}$

and

$\Delta^{\mathrm{RePIP}}(y) = \Delta^{\mathrm{PIP}}(y) + \gamma(R(y)-k_{\rm reg})^+$

where $R(y)$ is the rank of class $y$ and $\gamma$ is a regularization weight. Empirically, PIP and RePIP provide balanced coverage, set size, and singleton rate, outperforming both pure inverse probability and margin scores in complex image classification (Melki et al., 2024).

Energy-Based and Epistemic Uncertainty Enhanced Scores

To correct the overconfidence and lack of adaptivity of softmax-derived scores, recent methodology reweights nonconformity functions using Helmholtz Free Energy, computed from pre-softmax logits: $F(x) = -T\log \sum_j \exp(f_j(x)/T)$ The reweighted score is $S_{\rm EB}(x, y) = S(x, y)\phi(F(x))$ , with $\phi$ a positive monotonic function of the energy. This approach improves adaptivity and efficiency, especially in difficult, imbalanced, or OOD circumstances, producing smaller sets for easy inputs and larger sets for hard or unfamiliar samples (Attar et al., 23 Feb 2026).

Similarly, the EPICSCORE framework augments base scores by a multiple of the model's epistemic uncertainty, estimated via Bayesian models (GP, MC dropout, BART), so that prediction intervals are inflated in data-sparse regions. Formally,

$s_{\rm EPIC}(x, y) = s_{\rm base}(x, y) + \lambda \sigma_{\rm epi}(x)$

where $\sigma_{\rm epi}(x)$ is the posterior predictive standard deviation conditional on x. This approach combines finite-sample marginal coverage and asymptotic conditional coverage, yielding sets that expand adaptively in uncertain regions (Cabezas et al., 10 Feb 2025).

4. Data Dependency, Aggregation, and Adaptive Extensions

Model Aggregation

Symmetric Aggregated Conformal Prediction (SACP) employs nonconformity score vectors produced by an ensemble of $K$ base models, transforms each via exchangeable e-values, and aggregates them with a symmetric, monotonic function (e.g., sum, mean, power mean): $F_i(y) = f(E^{(1)}_i(y), \ldots, E^{(K)}_i(y)), \quad F_{\rm test}(y) = f(E^{(1)}_{\rm test}(y), \ldots)$ Calibration and prediction proceed via the aggregated scores, attaining exact coverage. Empirically, the SACP pipeline—especially when p-optimized over the family of power sums—consistently yields sharper uncertainty sets than majority-vote or naive ensembling (Alami et al., 7 Dec 2025).

Semi-Supervised, Local, and Clustering-Driven Scores

The Nearest-Neighbor Matching (NNM) score extends CP to semi-supervised calibration. For each unlabeled point, the score is imputed by matching its pseudo-label nonconformity to the closest labeled sample and bias-correcting with that sample’s labeled-vs-pseudo nonconformity gap: $\tilde S_{\rm nnm}(\tilde x) = S(\tilde x, \hat y) + [S(x_{j^*}, y_{j^*}) - S(x_{j^*}, \hat y_{j^*})]$ This approach augments calibration pools, improving coverage stability and set size in data-scarce regimes (Zhou et al., 27 May 2025).

Clustering methods use nonconformity-based groupings of calibration data to partition and calibrate locally; for drug-target interaction uncertainty, clustering on residual CDF percentiles yields the tightest, most reliable confidence intervals among various group-conditioned and nearest-neighbor methods (Rakhshaninejad et al., 24 May 2025).

Localized calibration may also employ tree-based weights (QRF) or kernel methods to define data-dependent local distributions of nonconformity, as in adaptive conformal prediction by reweighting scores (Amoukou et al., 2023).

Adaptive, online, or dynamic extensions (e.g., AdaptNC) explicitly optimize both the nonconformity function's parameters and the conformal threshold in a joint online manner, which is essential in nonstationary environments such as robotics, outperforming threshold-scaling-only approaches in maintaining tight sets and exact coverage under shift (Tumu et al., 2 Feb 2026).

5. Specialized Nonconformity Scores in Domains Beyond Standard Prediction

Spectral Clustering

Conformal Prediction-based Spectral Clustering constructs asymmetric affinities using nonconformity scores derived from either k-nearest-neighbor distances or negative kernel density estimates, quantifying the "strangeness" of a point relative to neighborhood subgraphs. The resulting affinity matrix,

$A_{ij} = P(z_i, Nbd(z_j))$

where $P$ is the conformal p-value, adapts to heterogeneous density and local cluster structure, improving normalized mutual information and stability over kernel-only or self-tuning schemes (Chintalapati et al., 2019).

Recommendation Systems

Conformal recommendation frameworks utilize specialized nonconformity measures suited to sequential or association-mining recommenders. For group recommenders, nonconformity is constructed as a weighted product over association probabilities, normalized by item support, with leave-one-out recalculation to ensure exchangeability: $\alpha_i = \frac{\mathrm{Support}(o’_j)}{|U|} \cdot \prod_{o_l\in O^4_i} w_l \cdot P(o_l | o’_j)$ This score tightly calibrates the confidence of group recommendations, while general inductive conformal recommenders offer a taxonomy of 17 distinct conformity/nonconformity aggregations leveraging precedences, supports, and propagation, with empirical evidence favoring median/mean aggregations for validity and efficiency (Kagita et al., 2023, Kagita et al., 2021).

6. Nonconformity Scores under Data Scarcity and Their Empirical Efficiency

Empirical studies emphasize that—in small-sample regimes—absolute-error, normalized absolute error, and quantile-based nonconformity measures display pronounced trade-offs in efficiency. Absolute residuals are robust and simple under homoscedastic noise. Normalized error scores adapt to heteroskedasticity but can be unstable at small n, while quantile-based (CQR-type) measures outperform under strong asymmetry or heavy-tailed noise but can be overly conservative if model misspecification or small-sample effects cause quantile estimates to disagree (Kato et al., 2024).

Model selection and careful residual analysis are necessary for mitigating high-variance interval widths; practitioners should monitor standard error and interval outlier frequency across random splits. Notably, all such measures maintain marginal coverage regardless of efficiency in finite-sample settings.

7. Theoretical Guarantees and Design Trade-offs

All standard CP coverage guarantees—finite-sample marginal coverage, training-conditional coverage, and, under suitable conditions, asymptotic conditional coverage—hold for any measurable nonconformity function, provided exchangeability assumptions are met. Localized, adaptively normalized, or energy/epistemic reweighted scores preserve all theoretical guarantees if the calibration protocol retains the exchangeability structure (Seedat et al., 2023, Amoukou et al., 2023, Cabezas et al., 10 Feb 2025, Kumar et al., 26 Sep 2025, Tumu et al., 2 Feb 2026, Attar et al., 23 Feb 2026).

However, the choice of score directly modulates prediction efficiency, set adaptivity, and practical informativeness; standard choices such as inverse-probability minimize average set size, margin scores maximize the fraction of singletons, and hybrid or regularized constructions strike a balance between these desiderata (Aleksandrova et al., 2021, Melki et al., 2024, Wang et al., 28 Sep 2025).

In backward conformal prediction (BCP), Markov's inequality is used to convert e-variable–based nonconformity distributions into prediction sets with size guarantees; transforming scores to two-point distributions can shrink the gap between estimated and actual miscoverage, materially tightening theoretical bounds without affecting prediction sets (Liu et al., 2 Feb 2026).

References

(Seedat et al., 2023): Improving Adaptive Conformal Prediction Using Self-Supervised Learning
(Kagita et al., 2023): Conformal Group Recommender System
(Chintalapati et al., 2019): Conformal Prediction based Spectral Clustering
(Melki et al., 2024): The Penalized Inverse Probability Measure for Conformal Classification
(Tumu et al., 2 Feb 2026): AdaptNC: Adaptive Nonconformity Scores for Uncertainty-Aware Autonomous Systems in Dynamic Environments
(Wang et al., 28 Sep 2025): Singleton-Optimized Conformal Prediction
(Braun et al., 28 Jul 2025): Multivariate Conformal Prediction via Conformalized Gaussian Scoring
(Kumar et al., 26 Sep 2025): Learnable Conformal Prediction with Context-Aware Nonconformity Functions for Robotic Planning and Perception
(Rakhshaninejad et al., 24 May 2025): Conformal Prediction for Uncertainty Estimation in Drug-Target Interaction Prediction
(Zhou et al., 27 May 2025): Semi-Supervised Conformal Prediction With Unlabeled Nonconformity Score
(Aleksandrova et al., 2021): How Nonconformity Functions and Difficulty of Datasets Impact the Efficiency of Conformal Classifiers
(Amoukou et al., 2023): Adaptive Conformal Prediction by Reweighting Nonconformity Score
(Gupta et al., 2019): Nested conformal prediction and quantile out-of-bag ensemble methods
(Alami et al., 7 Dec 2025): Symmetric Aggregation of Conformity Scores for Efficient Uncertainty Sets
(Liu et al., 2 Feb 2026): ST-BCP: Tightening Coverage Bound for Backward Conformal Prediction via Non-Conformity Score Transformation
(Attar et al., 23 Feb 2026): Softmax is not Enough (for Adaptive Conformal Classification)
(Cabezas et al., 10 Feb 2025): Epistemic Uncertainty in Conformal Scores: A Unified Approach
(Kagita et al., 2021): Inductive Conformal Recommender System
(Kato et al., 2024): Inductive Conformal Prediction under Data Scarcity: Exploring the Impacts of Nonconformity Measures