Papers
Topics
Authors
Recent
Search
2000 character limit reached

Nonconformity Scores in Conformal Prediction

Updated 25 February 2026
  • Nonconformity scores are functions that quantify how atypical a candidate label is relative to observed data and a learned model.
  • The calibration and design of these scores—including normalization, adaptivity, and energy-based adjustments—directly affect the efficiency and size of prediction sets.
  • Specialized constructions extend nonconformity scores to domains like clustering, recommender systems, and online adaptations, balancing coverage guarantees with practical informativeness.

A nonconformity score is a function, typically denoted α or S, that quantifies the "strangeness" or 1ity of a candidate label y relative to a given input x (and learned model f) by comparison to a reference set of observed data. In the conformal prediction (CP) framework, nonconformity scores are central to constructing set- or interval-valued predictions that achieve rigorous statistical guarantees for coverage, in both classification and regression, as well as in more specialized contexts such as clustering and recommender systems. The specific definition of the nonconformity score, and the choice of how to calibrate and adapt it, heavily affects the efficiency and adaptivity of the prediction sets. This article systematically surveys the mathematical definitions, constructions, design considerations, and empirical impacts of nonconformity scores, as well as their extensions and specializations across modern CP methodologies.

1. Foundational Principles and Definitions

A nonconformity score is a measurable function

S:X×YRS:\mathcal X \times \mathcal Y \to \mathbb R

that, given a test observation x and a candidate label y, assigns a scalar value indicating how "atypical" y is in the context of the learned model f and the calibration data. In most conformal algorithms, S is constructed so that larger values indicate greater nonconformity. For regression, classical choices include absolute residuals

S(x,y)=yf(x)S(x, y) = |y - f(x)|

or normalized residuals

S(x,y)=yf(x)σ^(x)S(x, y) = \frac{|y - f(x)|}{\hat\sigma(x)}

with σ^(x)\hat\sigma(x) a local scale predictor. In classification, commonly employed choices are the inverse predicted probability

S(x,y)=1p(yx)S(x, y) = 1 - p(y \mid x)

the negative log-probability (cross-entropy),

S(x,y)=logp(yx)S(x, y) = -\log p(y \mid x)

or the margin score,

S(x,y)=maxkyp(kx)p(yx)S(x, y) = \max_{k \neq y} p(k \mid x) - p(y \mid x)

Such functions are evaluated on a held-out calibration set to generate a set of "reference" nonconformity scores {S(xi,yi)}\{S(x_i, y_i)\}. Given a new observation xtestx_{\mathrm{test}}, one computes the set {y:S(xtest,y)q1α}\{y : S(x_{\mathrm{test}}, y) \leq q_{1-\alpha}\}, where q1αq_{1-\alpha} is the (appropriately corrected) (1α)(1-\alpha) quantile of calibration scores, attaining marginal coverage at the specified level under exchangeability.

The design of SS—its informativeness and adaptivity to statistical heterogeneity or sample difficulty—directly controls the width or size of the CP prediction set. All coverage guarantees rest not on SS's form, but on the exchangeability of calibration and test samples.

2. Normalized, Adaptive, and Context-aware Nonconformity Scores

Adaptive Normalization: Heteroskedasticity and Instance Difficulty

To enhance efficiency, nonconformity scores are often locally normalized. For regression, this takes the form

γ(x,y)=yf(x)σ(x)\gamma(x, y) = \frac{|y - f(x)|}{\sigma(x)}

where σ(x)\sigma(x) is fit specifically to predict yf(x)|y-f(x)| (absolute residual). Calibration of γ(x,y)\gamma(x, y) yields intervals that shrink for "easy" (low-noise) inputs and widen for "hard" (high-noise) ones, closely matching instance-specific uncertainty (Seedat et al., 2023).

An extension introduces self-supervised features: an auxiliary model is trained using a domain-specific pretext task (e.g., autoencoding or VIME-style feature corruption), and its error ss(x)\ell_{ss}(x) is concatenated to x when training σ\sigma. The final normalized score is

γss(x,y)=yf(x)σ([x,ss(x)])\gamma_{ss}(x,y) = \frac{|y - f(x)|}{\sigma([x, \ell_{ss}(x)])}

Empirically, adding self-supervised error as a feature to the residual prediction model yields intervals that are more responsive to local data density, providing substantial gains in efficiency and adaptivity, especially in the long-tailed or data-sparse regimes (Seedat et al., 2023).

Context-Aware and Learnable Functions

In robotics and high-dimensional tasks, nonconformity scores constructed as task-agnostic functions can be overly conservative or insufficiently adaptive. Learnable Conformal Prediction (LCP) replaces s(x,y)s(x, y) with a context-sensitive neural function sθ(x,y)=fθ(ϕ(x,y))s_\theta(x, y) = f_\theta(\phi(x, y)), extracting geometric, semantic, and problem-specific features. Empirical results across robotics and vision tasks demonstrate that LCP reduces prediction set sizes (e.g., 4.7–9.9% for classification; up to 54% for detection intervals), while maintaining valid coverage (Kumar et al., 26 Sep 2025).

3. Advanced Nonconformity Score Constructions

Singleton-Optimized Scores

Standard score functions primarily target average set size minimization. Singleton-Optimized Conformal Prediction (SOCOP) introduces a score crafted to directly reduce the probability of non-singleton predictions. This construction solves a geometric optimization over monotone set-valued predictors, showing that the optimal nonconformity for the singleton cost is given in closed form as the smallest slope of the lower convex hull of cumulative label probabilities above a given class rank. The resulting algorithm is O(K)O(K) in the number of classes and increases singleton prediction frequency by up to 20% with little change in average set size (Wang et al., 28 Sep 2025).

Penalized and Regularized Family

The Penalized Inverse Probability (PIP) and its regularized variant RePIP combine inverse probability and a penalty that aggregates information from higher-probability classes, allowing fine-grained control of the efficiency–informativeness tradeoff. These scores are defined as

ΔPIP(y)=1p^y+r=1R(y)1p^[r]r\Delta^{\mathrm{PIP}}(y) = 1 - \hat p^y + \sum_{r=1}^{R(y)-1} \frac{\hat p^{[r]}}{r}

and

ΔRePIP(y)=ΔPIP(y)+γ(R(y)kreg)+\Delta^{\mathrm{RePIP}}(y) = \Delta^{\mathrm{PIP}}(y) + \gamma(R(y)-k_{\rm reg})^+

where R(y)R(y) is the rank of class yy and γ\gamma is a regularization weight. Empirically, PIP and RePIP provide balanced coverage, set size, and singleton rate, outperforming both pure inverse probability and margin scores in complex image classification (Melki et al., 2024).

Energy-Based and Epistemic Uncertainty Enhanced Scores

To correct the overconfidence and lack of adaptivity of softmax-derived scores, recent methodology reweights nonconformity functions using Helmholtz Free Energy, computed from pre-softmax logits: F(x)=Tlogjexp(fj(x)/T)F(x) = -T\log \sum_j \exp(f_j(x)/T) The reweighted score is SEB(x,y)=S(x,y)ϕ(F(x))S_{\rm EB}(x, y) = S(x, y)\phi(F(x)), with ϕ\phi a positive monotonic function of the energy. This approach improves adaptivity and efficiency, especially in difficult, imbalanced, or OOD circumstances, producing smaller sets for easy inputs and larger sets for hard or unfamiliar samples (Attar et al., 23 Feb 2026).

Similarly, the EPICSCORE framework augments base scores by a multiple of the model's epistemic uncertainty, estimated via Bayesian models (GP, MC dropout, BART), so that prediction intervals are inflated in data-sparse regions. Formally,

sEPIC(x,y)=sbase(x,y)+λσepi(x)s_{\rm EPIC}(x, y) = s_{\rm base}(x, y) + \lambda \sigma_{\rm epi}(x)

where σepi(x)\sigma_{\rm epi}(x) is the posterior predictive standard deviation conditional on x. This approach combines finite-sample marginal coverage and asymptotic conditional coverage, yielding sets that expand adaptively in uncertain regions (Cabezas et al., 10 Feb 2025).

4. Data Dependency, Aggregation, and Adaptive Extensions

Model Aggregation

Symmetric Aggregated Conformal Prediction (SACP) employs nonconformity score vectors produced by an ensemble of KK base models, transforms each via exchangeable e-values, and aggregates them with a symmetric, monotonic function (e.g., sum, mean, power mean): Fi(y)=f(Ei(1)(y),,Ei(K)(y)),Ftest(y)=f(Etest(1)(y),)F_i(y) = f(E^{(1)}_i(y), \ldots, E^{(K)}_i(y)), \quad F_{\rm test}(y) = f(E^{(1)}_{\rm test}(y), \ldots) Calibration and prediction proceed via the aggregated scores, attaining exact coverage. Empirically, the SACP pipeline—especially when p-optimized over the family of power sums—consistently yields sharper uncertainty sets than majority-vote or naive ensembling (Alami et al., 7 Dec 2025).

Semi-Supervised, Local, and Clustering-Driven Scores

The Nearest-Neighbor Matching (NNM) score extends CP to semi-supervised calibration. For each unlabeled point, the score is imputed by matching its pseudo-label nonconformity to the closest labeled sample and bias-correcting with that sample’s labeled-vs-pseudo nonconformity gap: S~nnm(x~)=S(x~,y^)+[S(xj,yj)S(xj,y^j)]\tilde S_{\rm nnm}(\tilde x) = S(\tilde x, \hat y) + [S(x_{j^*}, y_{j^*}) - S(x_{j^*}, \hat y_{j^*})] This approach augments calibration pools, improving coverage stability and set size in data-scarce regimes (Zhou et al., 27 May 2025).

Clustering methods use nonconformity-based groupings of calibration data to partition and calibrate locally; for drug-target interaction uncertainty, clustering on residual CDF percentiles yields the tightest, most reliable confidence intervals among various group-conditioned and nearest-neighbor methods (Rakhshaninejad et al., 24 May 2025).

Localized calibration may also employ tree-based weights (QRF) or kernel methods to define data-dependent local distributions of nonconformity, as in adaptive conformal prediction by reweighting scores (Amoukou et al., 2023).

Adaptive, online, or dynamic extensions (e.g., AdaptNC) explicitly optimize both the nonconformity function's parameters and the conformal threshold in a joint online manner, which is essential in nonstationary environments such as robotics, outperforming threshold-scaling-only approaches in maintaining tight sets and exact coverage under shift (Tumu et al., 2 Feb 2026).

5. Specialized Nonconformity Scores in Domains Beyond Standard Prediction

Spectral Clustering

Conformal Prediction-based Spectral Clustering constructs asymmetric affinities using nonconformity scores derived from either k-nearest-neighbor distances or negative kernel density estimates, quantifying the "strangeness" of a point relative to neighborhood subgraphs. The resulting affinity matrix,

Aij=P(zi,Nbd(zj))A_{ij} = P(z_i, Nbd(z_j))

where PP is the conformal p-value, adapts to heterogeneous density and local cluster structure, improving normalized mutual information and stability over kernel-only or self-tuning schemes (Chintalapati et al., 2019).

Recommendation Systems

Conformal recommendation frameworks utilize specialized nonconformity measures suited to sequential or association-mining recommenders. For group recommenders, nonconformity is constructed as a weighted product over association probabilities, normalized by item support, with leave-one-out recalculation to ensure exchangeability: αi=Support(oj)UolOi4wlP(oloj)\alpha_i = \frac{\mathrm{Support}(o’_j)}{|U|} \cdot \prod_{o_l\in O^4_i} w_l \cdot P(o_l | o’_j) This score tightly calibrates the confidence of group recommendations, while general inductive conformal recommenders offer a taxonomy of 17 distinct conformity/nonconformity aggregations leveraging precedences, supports, and propagation, with empirical evidence favoring median/mean aggregations for validity and efficiency (Kagita et al., 2023, Kagita et al., 2021).

6. Nonconformity Scores under Data Scarcity and Their Empirical Efficiency

Empirical studies emphasize that—in small-sample regimes—absolute-error, normalized absolute error, and quantile-based nonconformity measures display pronounced trade-offs in efficiency. Absolute residuals are robust and simple under homoscedastic noise. Normalized error scores adapt to heteroskedasticity but can be unstable at small n, while quantile-based (CQR-type) measures outperform under strong asymmetry or heavy-tailed noise but can be overly conservative if model misspecification or small-sample effects cause quantile estimates to disagree (Kato et al., 2024).

Model selection and careful residual analysis are necessary for mitigating high-variance interval widths; practitioners should monitor standard error and interval outlier frequency across random splits. Notably, all such measures maintain marginal coverage regardless of efficiency in finite-sample settings.

7. Theoretical Guarantees and Design Trade-offs

All standard CP coverage guarantees—finite-sample marginal coverage, training-conditional coverage, and, under suitable conditions, asymptotic conditional coverage—hold for any measurable nonconformity function, provided exchangeability assumptions are met. Localized, adaptively normalized, or energy/epistemic reweighted scores preserve all theoretical guarantees if the calibration protocol retains the exchangeability structure (Seedat et al., 2023, Amoukou et al., 2023, Cabezas et al., 10 Feb 2025, Kumar et al., 26 Sep 2025, Tumu et al., 2 Feb 2026, Attar et al., 23 Feb 2026).

However, the choice of score directly modulates prediction efficiency, set adaptivity, and practical informativeness; standard choices such as inverse-probability minimize average set size, margin scores maximize the fraction of singletons, and hybrid or regularized constructions strike a balance between these desiderata (Aleksandrova et al., 2021, Melki et al., 2024, Wang et al., 28 Sep 2025).

In backward conformal prediction (BCP), Markov's inequality is used to convert e-variable–based nonconformity distributions into prediction sets with size guarantees; transforming scores to two-point distributions can shrink the gap between estimated and actual miscoverage, materially tightening theoretical bounds without affecting prediction sets (Liu et al., 2 Feb 2026).


References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nonconformity Scores.