Nonconformity Score in Conformal Prediction

Updated 26 August 2025

Nonconformity score is a real-valued measure that quantifies a data point’s atypicality relative to a reference set, forming the basis of conformal prediction.
It underpins methods across anomaly detection, classification, and clustering by balancing prediction interval efficiency with coverage guarantees.
Recent advances include adaptive, Bayesian, and synthetic calibration approaches that enhance robustness and accuracy in diverse application domains.

A nonconformity score, also referred to as a nonconformity measure (NCM), is a real-valued function that quantifies how "different" or "atypical" a data point is relative to a reference set or model. In conformal prediction and related inferential frameworks, nonconformity scores are pivotal for assigning probabilistic or risk-based assessments to predictions, anomaly detection, clustering, risk control, and a variety of domain-specific tasks. The technical design and selection of nonconformity scores profoundly affect the efficiency, informativeness, and adaptivity of resulting prediction sets or decision rules.

1. Formal Definitions and General Role

The nonconformity score is defined as a function $\mathrm{NCM}: \mathcal{Z}^* \times \mathcal{Z} \rightarrow \mathbb{R}$ , where $\mathcal{Z}$ is the sample space (inputs, outputs, or objects), assigning to each candidate $z$ a numerical value reflecting its "degree of nonconformity" with respect to a dataset $Z^*$ . For points with higher nonconformity scores, the underlying learning algorithm deems them less compatible with prior data, often leading to larger or lower-confidence prediction sets.

In classical conformal prediction, the nonconformity score $\alpha_i$ for example $z_i = (x_i, y_i)$ might be

$\alpha_i = S(x_i, y_i) = |y_i - \hat{g}(x_i)|$

for regression with a point estimator $\hat{g}$ , or

$\alpha_i = 1 - \hat{P}_{h}(y_i \mid x_i)$

for classification using a base classifier $h$ and associated probabilities.

Nonconformity scores are central in constructing p-values, which, through the conformal paradigm, enable marginal coverage, conditional risk control, or probabilistic guarantees in a range of tasks (Burnaev et al., 2016, Ishimtsev et al., 2017, Aleksandrova et al., 2021, Farinhas et al., 2023).

2. Nonconformity Scores in Anomaly Detection

For time-series anomaly detection, nonconformity scores are typically based on statistical distances or density estimates:

k-NN Distance Score: The average (or sum) of the distances from an observation to its $k$ nearest neighbors in a reference set:

$\alpha = \frac{1}{k} \sum_{j=1}^k d(z, \mathrm{NN}_j(z, X_T))$

where $X_T$ is the "proper training" set and $d$ denotes a metric, e.g., Mahalanobis or Euclidean (Burnaev et al., 2016).

LOF-based Score: Based on the Local Outlier Factor, quantifying density relative to neighbors (Burnaev et al., 2016).

After computing scores for the calibration set and a test point, the conformal anomaly score (p-value) is

$p = \frac{|\{ i=1, \ldots, C\ :\ \alpha_i \geq \alpha_t \}|}{C}$

with $\alpha_t$ the test point score. Low $p$ identifies anomalous or nonconforming points. In adaptive or non-stationary settings, sliding windows for calibration and training allow scores to track local changes in the data distribution (Ishimtsev et al., 2017).

These methodologies facilitate anomaly detection in domains such as healthcare, finance, and security, as shown using the Numenta Anomaly Benchmark (Burnaev et al., 2016).

3. Nonconformity Measures for Classification

In classification, the nonconformity score determines the trade-off between prediction set size (efficiency) and the fraction of singleton (unambiguous) predictions (informativeness):

Inverse Probability Score (IP/Hinge loss):

$\Delta_\mathrm{IP}(x, y) = 1 - \hat{P}_h(y | x)$

Efficient—produces small prediction sets, but fewer singletons (Aleksandrova et al., 2021, Melki et al., 13 Jun 2024).

Margin Score:

$\Delta_\mathrm{MARGIN}(x, y) = \max_{y' \ne y} \hat{P}_h(y' | x) - \hat{P}_h(y | x)$

More informative—yields more singleton predictions, at the cost of larger sets.

Penalized Inverse Probability (PIP) and Regularized PIP (RePIP):

$\Delta_\mathrm{PIP}(y) = [1 - \hat{p}^y] + \sum_{r=1}^{R(y)-1} \frac{\hat{p}^{[r]}}{r} \quad$

RePIP introduces an additional $\gamma(R(y) - k_\mathrm{reg})^+$ penalty. These measures interpolate between IP and margin, enhancing both efficiency and informativeness (Melki et al., 13 Jun 2024).

Choice and optimization of nonconformity scores (e.g., via hyperparameters or application-specific objectives such as maximizing F1 or singleton counts) are pivotal in high-stakes domains like medical diagnostics and gravitational wave glitch classification (Malz et al., 16 Dec 2024).

4. Conformal Risk Control and Extensions

The nonconformity framework generalizes to monotone loss functions for arbitrary risk control tasks. Here, the score is the (bounded, nonincreasing) loss $L(\lambda; x, y)$ , which may represent error, miscoverage, false-negative rate, or specialized risk (e.g., best-token F1):

$\hat{R}_n(\lambda) = \frac{1}{N_w} \sum_i w_i L(\lambda; x_i, y_i)$

$\hat{\lambda} = \inf\left\{ \lambda:\ \frac{N_w}{N_w+1}\hat{R}_n(\lambda) + \frac{B}{N_w+1} \leq \alpha \right\}$

By weighting examples (e.g., based on temporal proximity or relevance), nonconformity-based risk control can adapt to non-exchangeable settings, such as time series with changepoints (Farinhas et al., 2023).

5. Nonconformity in Adaptive, Robust, and Hybrid Methods

a. Adaptive Nonconformity via Local and Bayesian Models

Quantile Regression Forests (QRFs): Weights from QRFs model the local distribution of residuals, adaptively reweighting calibration scores to produce locally valid prediction intervals. The conformal threshold is chosen via a quantile computed from the (test-point dependent) weighted empirical CDF of residuals (Amoukou et al., 2023).
Bayesian Epistemic Uncertainty (EPICSCORE): The nonconformity score is "lifted" by the posterior predictive CDF $F(s(x, y) \mid x, D)$ estimated via Gaussian Processes, MC Dropout, or Bayesian Additive Regression Trees:

$s'(x, y) = F(s(x, y) \mid x, D)$

This ensures adaptive intervals that widen in regions of epistemic uncertainty, while retaining finite-sample marginal coverage and achieving asymptotic conditional coverage (Cabezas et al., 10 Feb 2025).

b. Semi-supervised and Synthetic-Data-Driven Calibration

Nearest Neighbor Matching (NNM) for Unlabeled Data: In low-label regimes, estimation of the nonconformity score for an unlabeled sample $\tilde{x}$ uses the bias correction derived from a labeled neighbor’s pseudo-score, improving the quality of the calibration threshold (Zhou et al., 27 May 2025).
Score Transporter for Synthetic Calibration: Aligns nonconformity score distributions between small real calibration sets and larger synthetic ones, constructing prediction sets via transported scores and synthetic-derived quantiles, significantly improving efficiency in data-scarce environments (Bashari et al., 19 May 2025).
Label Noise Robustness: When calibration labels are noisy, the expected noise-free nonconformity score is computed analytically from the noise model, e.g. for random flip noise of rate $\epsilon$ :

$\hat{S}(x, \tilde{y}, \epsilon) = (1 - \epsilon) S(x, \tilde{y}) + \epsilon \cdot \frac{1}{k} \sum_{i=1}^k S(x, i)$

yielding smaller, more meaningful prediction sets at test time (Penso et al., 4 May 2024).

6. Domain-Specific Nonconformity Measures

a. Recommender and Group Systems

Nonconformity measures in these systems use statistics reflecting item precedence or association:

$\mathrm{CM1}(o) = \frac{\mathrm{Sup}(o)}{n_u} \prod^{I}_{o_l \in O^t} P(o_l \mid o)$

or, for group recommendations,

$\alpha^i(o'_j) = (\mathrm{Support}(o'_j) / n_\text{user}) \prod_{o_l} [w_l P(o_l | o'_j)]$

These are designed to ensure both efficiency (computationally and statistically) and exchangeability for valid coverage (Kagita et al., 2021, Kagita et al., 2023).

b. Clustering and Metric Space Inference

Spectral Clustering: Nonconformity-based affinity between two points $z_i, z_j$ uses the P-value (degree of conformity) of one relative to the other's neighborhood:

$\hat{A}(z_i, z_j) = \frac{P(z_i, \mathrm{Nbd}(z_j)) + P(z_j, \mathrm{Nbd}(z_i))}{2}$

This approach generalizes contextual similarity, improving robustness to varied cluster structure (Chintalapati et al., 2019).

Random Objects and Metric Spaces: The average transport cost between response Y’s conditional distance profile and a candidate profile,

$C(\omega \mid x) = E\left[ \int_0^\infty |F_{\omega,x}(t) - F_{Y,x}(t)| dt \mid X = x \right]$

becomes the conformity score, particularly well-suited to non-Euclidean and multimodal data (Zhou et al., 1 May 2024).

c. Multivariate Regression

Mahalanobis-Based Scoring: For $Y|X \sim \mathcal{N}(f_\theta(X), \Sigma_\phi(X))$ , the nonconformity score is

$S_\text{mah}(X, Y) = \| \Sigma_\phi(X)^{-1/2} (Y - f_\theta(X)) \|_2$

enabling closed-form, heteroskedastic, and transformation-adaptive conformal sets in high-dimensional and structured-output settings (Braun et al., 28 Jul 2025).

7. Impact, Limitations, and Perspectives

The choice and computation of nonconformity scores are fundamental to the practical guarantees—efficiency, adaptivity, informativeness, and coverage—of modern statistical learning and uncertainty quantification. While various classes of scores are appropriate for regression, classification, clustering, anomaly detection, risk control, and structured inference, all require careful calibration to the domain, data characteristics (e.g., noise, scarcity, heterogeneity), and inferential objectives.

Limitations include increased computational overhead for nonparametric or adaptive scoring, the need for additional modeling in noisy or semi-supervised settings, and the possibility that no single score is optimal across all data regimes. Multiple works demonstrate that trade-offs must be empirically evaluated: for instance, more complex or locally-adaptive nonconformity measures may improve prediction interval tightness under heteroscedasticity or data sparsity, but can introduce instability or inefficiency when sample sizes are very low (Kato et al., 13 Oct 2024).

Continued research aims to develop robust, adaptive, and domain-aligned nonconformity scores for emerging applications, leveraging advances in Bayesian inference, generative modeling, and large-scale data integration (Cabezas et al., 10 Feb 2025, Bashari et al., 19 May 2025).(Burnaev et al., 2016, Ishimtsev et al., 2017, Chintalapati et al., 2019, Aleksandrova et al., 2021, Kagita et al., 2021, Amoukou et al., 2023, Kagita et al., 2023, Farinhas et al., 2023, Zhou et al., 1 May 2024, Penso et al., 4 May 2024, Melki et al., 13 Jun 2024, Kato et al., 13 Oct 2024, Malz et al., 16 Dec 2024, Cabezas et al., 10 Feb 2025, Bashari et al., 19 May 2025, Zhou et al., 27 May 2025, Braun et al., 28 Jul 2025)