Conformal Scores: Foundations and Advances

Updated 26 August 2025

Conformal scores are numerical measures that quantify the atypicality of predictions relative to calibration data, ensuring finite-sample coverage guarantees.
Localized and adaptive methods refine these scores by weighting nearby calibration data and normalizing error scales to address feature space heterogeneity.
Advanced techniques—including OT-based, noise-robust, and Bayesian score adjustments—enhance prediction reliability in high-dimensional and uncertain label environments.

Conformal scores are the central statistical device underpinning modern distribution-free uncertainty quantification in machine learning. They assign a numerical value to a candidate prediction, providing a means to assess its typicality relative to calibration data. Recent research has substantially expanded the design, computation, and application of conformal scores, especially in settings requiring adaptivity, local coverage, general metric space outputs, robustness to label noise, and high-dimensional inference. This article reviews foundational aspects of conformal scores and synthesizes advances from recent literature, with special attention to localized conformal prediction, adaptive score rescaling, robustness, and coverage guarantees.

1. Conformal Scores: Definition and Role

Conformal scores, also called nonconformity or conformity scores, quantify the degree of alignment of a candidate outcome with respect to a calibration sample. Formally, given a predictor $X$ and response $Y$ , a conformal score $V(x, y)$ encodes how “atypical” $y$ is for $x$ . In conformal prediction (CP), these scores enable construction of prediction sets (regression intervals, classification sets, etc.) with distribution-free finite-sample coverage guarantees. The general CP procedure uses a training/calibration split: scores $V_i=V(X_i,Y_i)$ are computed on calibration data, and new predictions on a test input $x_{n+1}$ are accepted if $V(x_{n+1}, y)$ falls within a suitably chosen quantile of the empirical score distribution.

The choice and construction of $V(x, y)$ is application-dependent:

Regression: $V(x, y) = |y - \hat{g}(x)|$ .
Classification: $V(x, y) = 1 - p(y|x)$ .
Multivariate/multi-output: scores reflect joint discrepancy, often via $\ell_p$ norms, Mahalanobis distance, or latent/probabilistic transformations.

2. Localized and Adaptive Conformal Scores

Localized Conformal Prediction (LCP)

Traditional CP aggregates calibration scores uniformly, leading to coverage that may ignore feature-space heterogeneity. LCP generalizes this mechanism by emphasizing calibration scores from data points similar to the test sample. This is achieved via a localizer function $H(x, x')$ that assigns weights based on similarity (e.g., $H(x, x') = \exp(-d(x, x')/h)$ for some metric $d$ and bandwidth $h$ ):

$p_{n+1,i}^H = \frac{H(X_{n+1}, X_i)}{\sum_{j=1}^{n+1} H(X_{n+1}, X_j)}$

The weighted empirical distribution $\hat{H}$ thus adapts quantile calculation to local calibration data, enabling prediction sets that contract in regions of low uncertainty and widen where heterogeneity is higher. LCP retains finite-sample marginal coverage under exchangeability, and, under additional regularity assumptions (e.g., Lipschitz continuity of the conditional distribution of the score), asymptotic conditional coverage can also be achieved. The tuning of the bandwidth $h$ offers a principled bias–variance trade-off, calibrated via cross-validation or data-driven heuristics (Guan, 2021).

Adaptive Score Rescaling

When the scale or heterogeneity of prediction errors varies across the feature space, global conformal scores can lead to non-uniform coverage. Adaptive methods—such as Jackknife+ rescaled scores—remedy this by normalizing the raw score by an estimate of its local conditional expectation or quantile:

$\sigma(X, y) = \frac{s(X, y)}{\hat{\bar{s}}(X)}$

with

$\hat{\bar{s}}(X_i) = \sum_{j \ne i} p_{ij} s_j$

where $p_{ij}$ are kernel weights on the calibration set.

The leave-one-out (or leave-two-out) resampling preserves exchangeability, ensuring global marginal coverage, while local adaptivity tightens intervals where uncertainty is lower, as supported by mutual information-based bounds on conditional coverage (Deutschmann et al., 2023).

3. Extensions to Complex Output Spaces and Data Types

Metric Space and Multivariate Outputs

For object-valued or multi-output responses, canonical ordering is lacking. Advanced conformal scores have been introduced based on:

Conditional profile average transport costs: For data in a metric space $\mathcal{M}$ , the cost to transport the distance profile centered at $\omega \in \mathcal{M}$ to that centered at $Y$ is evaluated and used as the basic score, with conditional profile score $S(z|x) = P\{ C(Y|X) \leq z | X = x\}$ guiding point inclusion in prediction sets (Zhou et al., 1 May 2024).
Optimal transport rankings: Vector-valued conformity scores $S(x, y) \in \mathbb{R}^d$ are mapped via a Brenier optimal transport map $T^*$ to a reference uniform distribution on the unit ball. The transformed score $\|T^*(S(x, y))\|$ is then ranked to produce conformal sets, preserving finite-sample marginal coverage and geometric fidelity in high dimensions (Klein et al., 5 Feb 2025).

Multivariate and Generalized Conformity Scores

Recent studies introduce conformity score classes with improved conditional coverage and computational efficiency for multi-output regression:

CDF-based conformity scores: $s_{\text{CDF}}(x, y) = P(s_W(X, Y) \le s_W(x, y) \mid X = x)$ , with $s_W$ a base generative score, providing uniform push-forward and improved adaptivity.
Latent-based conformity scores: For invertible generative models $Q$ , $s_{\text{L-CP}}(x, y) = d_Z(Q^{-1}(y;x))$ leverages the latent space for region construction, achieving conditional coverage as calibration size increases (Dheur et al., 17 Jan 2025).

These strategies enable coverage for high-dimensional, structured, or non-Euclidean outputs (e.g., networks, compositions, geospatial coordinates).

4. Robustness, Calibration, and Specialized Designs

Robustness to Label Noise

Label noise impairs reliability of conformal calibration. Noise-robust conformal scores invert the noise process: for observed noisy label $\tilde{y}$ and noise level $\epsilon$ ,

$\hat{S}(x, \tilde{y}, \epsilon) = (1 - \epsilon) S(x, \tilde{y}) + \epsilon S(x)$

with $S(x)$ the average over all possible labels.

Thresholding is performed on the estimated noise-free scores, and at test time, the prediction set is formed by reverting to the original (clean) score for each candidate $y$ , yielding efficiency gains in set size without loss of coverage (Penso et al., 4 May 2024).

Weighted Aggregation of Scores

Multiple base score functions can be aggregated for improved predictive efficiency. Given $d$ component scores $s_j(x, y)$ , their weighted linear combination

$\langle w, s(x, y) \rangle = \sum_{j=1}^d w_j s_j(x, y)$

with $w$ chosen to minimize prediction set size under a valid coverage constraint yields empirical sets sharply outperforming single-score approaches, as substantiated both theoretically (via VC dimension arguments) and empirically (e.g., on CIFAR-100) (Luo et al., 14 Jul 2024).

Incorporation of Epistemic Uncertainty

Traditional conformal scores are insensitive to epistemic uncertainty. EPICSCORE integrates Bayesian predictive CDFs for the score into the conformal pipeline: $s'(x, y) = F(s(x, y) \mid x, D)$ where $F$ is the posterior predictive CDF given calibration data $D$ . This expands prediction sets in data-sparse regions and contracts them where data is abundant, recovering both finite-sample marginal coverage and asymptotic conditional coverage using model-agnostic Bayesian machinery (e.g., Gaussian processes, BART, MC dropout with MDNs) (Cabezas et al., 10 Feb 2025).

Conditional Score Rectification

Conditional coverage can be approximated by rectifying the original score via a learned local quantile: $\tilde{V}(x, y) = V(x, y) / \tau(x)$ where $\tau(x)$ is an estimate of the conditional $(1-\alpha)$ quantile of $V$ at $x$ , computed via quantile regression. Split-conformal calibration applied to these rectified scores improves local adaptivity, with theoretical upper bounds relating conditional coverage deviation to quantile estimation error (Plassier et al., 22 Feb 2025).

5. Joint, Sequential, and Transductive Scenarios

The behavior of conformal p-values and their joint distributions is critical in multi-point transductive prediction, multiple testing, and ranking:

Joint distribution: The vector of p-values for $m$ test points, combined with $n$ calibration scores, is distributed according to a discrete Polya urn model capturing positive dependence, enabling explicit non-asymptotic concentration inequalities for the empirical CDF of p-values—central to controlling error measures such as false coverage proportion (FCP) or false discovery proportion (FDP) uniformly in high-dimensional multiple prediction tasks (Gazin et al., 2023).
Ranking: With partially known ranks, upper and lower bounds on unobserved conformity scores are constructed, and the maximal attainable (proxy) score used for marginal and simultaneous coverage of ranking predictions. Advanced results allow FCP control across all test items (Fermanian et al., 20 Jan 2025).
E-values: An alternative to p-values in CP, e-values allow multiplicative aggregation across sequential batches, enabling anytime-valid guarantees (batch conformal validity via Ville’s inequality), fixed-size prediction set control, and handling of ambiguous ground truth via multi-annotator averaging (Gauthier et al., 17 Mar 2025).

6. Practical Considerations, Computational Trade-offs, and Coverage Properties

The practical implementation of advanced conformal scores involves several trade-offs and data-driven choices:

Computational cost: Weighted, kernel-based, or OT-based score transformation can be computationally intensive in dimension or sample size; approximations (e.g., AMP for leave-one-out solutions in high dimensions (Clarté et al., 21 Oct 2024)) and entropic regularization (for OT maps) offer tractable alternatives.
Tuning and validation: Localization and adaptive rescaling require hyperparameter tuning—bandwidth choice, kernel normalization, or local sample size—often addressed by automated cross-validation or calibration error minimization.
Coverage guarantees: All methods discussed target distribution-free marginal coverage at level $1-\alpha$ . Under exchangeability, marginal coverage is preserved; local coverage improves with successful adaptation or localization and is quantified via explicit theoretic bounds under regularity or large-sample assumptions.

The following table summarizes representative methodologies by their salient attributes:

Methodology	Principle	Targeted Coverage
Localized CP (LCP)	Weighted local scores	Finite-sample & local
Jackknife+ Rescaling	Local error normalization	Marginal & improved local
OT-based Scores	Multivariate ranking	Marginal (high-dim)
CDF- & Latent-based Scores	Uniformization / invertible transform	Marginal & asymptotic conditional
Weighted Score Aggregation	Data-driven linear combo	Marginal
Noise-Robust CP	De-bias via noise model	Marginal
EPICSCORE (Bayesian)	Epistemic adaptivity	Marginal & conditional
Score Rectification	Calibrated quantile scaling	Approx. conditional
E-Values	Martingale / anytime validity	Marginal

7. Applications and Outlook

Conformal scores enable principled uncertainty quantification in a range of modern settings:

Transfer learning and covariate shift: Adaptive scores, incorporating test-time covariate information, tighten prediction intervals and control FCP in domain-shifted environments.
Object- and structure-valued data: Conditional profile/transport-based scores address prediction for networks, catalogs, or compositions—critical in transportation, energy, and social network data.
LLMs: Lexical- or claim-level scores are filtered using data-adaptive conditional validity, boosting utility while maintaining credibility in generated text (Cherian et al., 14 Jun 2024).
Selection and multiple testing: Multivariate conformal selection with regionally monotone scores ensures FDR control in drug discovery and LLM output alignment (Bai et al., 1 May 2025).
High-dimensional regression: AMP-based algorithms replace costly exact computation, preserving coverage in the large-n, large-d regime (Clarté et al., 21 Oct 2024).

Looking ahead, advances will focus on scalable and robust score computation, integrating epistemic uncertainty, extending to non-exchangeable, federated, or privacy-preserving settings, and refining finite-sample and conditional guarantees in complex data regimes.

In sum, conformal scores—through continuous advances in their design and calibration—remain foundational to trustworthy, interpretable, and distribution-free predictive inference in modern machine learning.