Split Conformal Prediction & Unsupervised Calibration

Updated 10 October 2025

The paper introduces an adaptive method that conformalizes any predictive model to ensure finite-sample coverage guarantees using empirical quantiles.
It leverages unsupervised calibration via pseudo-labels and optimized weighting to maintain robust performance even under non-IID or partially labeled settings.
Practical implications include efficient split and cross-conformal techniques that adjust prediction intervals for both regression and classification tasks.

Split conformal prediction with unsupervised calibration refers to a suite of methodologies for constructing statistically valid prediction sets or distributions using only a “base” predictive system, while minimizing or entirely eliminating the need for labeled calibration data. This is achieved through adaptive post-processing—termed conformalization—which delivers finite-sample (marginal) probability calibration guarantees and, under certain regularity, can be effective even in non-IID or partially labeled settings. This approach relies on data-driven empirical quantiles from calibration samples, strategic weighting or optimization for pseudo-labels, and randomized tie-breaking, and extends naturally to both regression and classification. The following sections review foundational theoretical results, calibration algorithms, relaxation of the exchangeability assumption, unsupervised calibration schemes, performance guarantees, and contemporary applications.

1. General Framework and Calibration Methodology

The core idea is to start from any predictive system (denoted by $A$ )—which may provide point predictions or probabilistic outputs—and “conformalize” it by comparing its predictions to those observed in a calibration sample. For a new test input $x$ (and candidate label $y$ ), a conformity score $\alpha^y = A(\mathbf{z}_1,\ldots,\mathbf{z}_m, (x,y))$ is evaluated. The calibration module then compares $\alpha^y$ to the scores $\{\alpha_i\}$ observed on calibration data. The split-conformal predictive system (SCPS) produces a calibrated predictive distribution

$C^A(\mathbf{z}_1,\ldots,\mathbf{z}_n,(x,y),\tau) = \frac{1}{n-m+1} \left\{\# \ \alpha_i < \alpha^y\right\} + \frac{\tau}{n-m+1} \left\{\# \ \alpha_i = \alpha^y\right\} + \frac{\tau}{n-m+1}$

where $\tau \sim \mathcal{U}[0,1]$ provides randomized tie-breaking. This ensures that, under exchangeability, the calibrated predictive CDF $Q(\cdot)$ satisfies $P(Q \leq \alpha) = \alpha$ for all $\alpha \in [0,1]$ . This construction is fully adaptive: the output system can adjust both the location and shape of the predictive distribution for each $x$ .

The adaptivity and universality of this method mean that no prior assumptions (such as calibration or validity) are required of $A$ . In practice, $A$ can be a pointwise scoring rule, a quantile regression, or a full predictive density.

2. Split and Cross-Conformal Calibration Algorithms

The split-conformal approach divides the dataset into training and calibration partitions: the training set forms $A$ , while calibration evaluates conformity scores. This split offers computational efficiency by requiring only one model fit, and remains robust even when the base system is poorly calibrated.

Cross-conformal methods extend this principle by using multiple splits (folds), constructing several SCPS calibrated on different subsets, and aggregating the resulting outputs. Cross-conformal calibration achieves higher data efficiency at the cost of weaker (non-uniform) theoretical guarantees: the marginal calibration guarantee becomes approximate, but overall performance can improve in moderate-through-small sample regimes.

When labels are only weakly observed, or unavailable in the calibration set, unsupervised calibration can leverage pseudo-labels, partial labels, or weights derived by matching the predicted class distribution of the test data to those of the training set via an optimization problem. This can involve solving for weights $w_i(y)$ for each calibration sample $X_i$ and each possible label $y$ , so that the weighted score distribution matches the labeled training data. The prediction quantile is then computed over the weighted scores:

$\hat{q} = \text{Quantile}\big(\{(S(X_i,y), w_i(y)/n)\}_{i,y}; (1-\alpha)(1+1/n)\big).$

Theoretical guarantees bound the excess coverage loss as a function of the chosen function class, optimization error, and statistical complexity terms such as Rademacher complexity.

3. Relaxing Exchangeability: Calibration under Dependence and Data Contamination

Standard SCPS coverage guarantees rely on the exchangeability or IID assumption. However, in practical settings—such as time series, spatiotemporal, or contaminated data—exchangeability may fail. Several frameworks address this via explicit error quantification:

Concentration and decoupling: Coverage errors are bounded by a sum of calibration concentration ( $\varepsilon_\text{cal}$ ), decoupling/test error ( $\varepsilon_\text{test}$ ), and small slack terms due to sample size.
Coverage penalty: The marginal coverage becomes

$P(\text{coverage}) \geq 1 - \alpha - \eta \qquad \text{where } \eta = \varepsilon_\text{cal} + \delta_\text{cal} + \varepsilon_\text{test}.$

Data contamination: In the presence of Huber-type contamination, coverage bounds incorporate the Kolmogorov–Smirnov distance between the clean and contaminated score CDFs, and bias-corrected procedures (Contamination Robust Conformal Prediction, CRCP) adjust the selection quantile to correct for contamination effects.

For time series, the “switch coefficient” measures how much temporal dependence disrupts exchangeability; if the underlying process is stationary $\beta$ -mixing, coverage penalties scale with the mixing rate.

4. Unsupervised and Weakly Supervised Calibration

In scenarios where calibration labels are missing, only partially available, or costly to acquire, unsupervised calibration methods are employed.

Unlabeled or partially labeled calibration sets: Conformal sets are constructed by estimating the distribution of conformity scores using model predictions. If the prediction model has accuracy $1-\beta$ , the coverage guarantee degrades additively, yielding

$P(Y \in \mathcal{C}(X_\text{test})) \geq 1-\alpha - \beta.$

This reveals a principled tradeoff between model error and calibration-free coverage.

Weighted calibration via optimization: The use of convex optimization or kernel methods aligns the empirical (weighted) score distribution of the unlabeled calibration set with that of the labeled training set. Under certain regularity conditions, the marginal coverage loss is bounded by terms characterizing optimization error, RKHS norm, and empirical complexities.
Self-supervised calibration: Augmenting nonconformity score computation with auxiliary self-supervised losses allows the model to capture local uncertainty or “difficulty,” leading to improved adaptation without requiring additional labeled calibration data.
Partial label or set-valued supervision: Score functions are designed to be pessimistic (worst-case over all candidates in a set) or aggregate over candidate labels, ensuring that coverage is controlled given the available ambiguity.

5. Calibration Guarantees and Finite Sample Behavior

Analyses of SCPS yield finite-sample marginal and conditional coverage guarantees. If $n$ is the calibration size and the conformal threshold is chosen as the $\left\lceil (1-\alpha)(n+1) \right\rceil$ -th order statistic, then under exchangeability (and regular conformity function) the empirical coverage for $m$ test samples

$C_m^{(n,\alpha)} = \frac{1}{m} \sum_{i=1}^m \mathbb{I}\{ Y_{n+i} \in \mathcal{C}_n^{(\alpha)}(X_{n+i}) \}$

follows a Beta–Binomial law. As $m \to \infty$ , the limit is Beta $(b,g)$ where $b = \lceil (1-\alpha)(n+1) \rceil$ and $g = n-b+1$ . This permits explicit sample size selection for target coverage intervals.

In small-data settings, the Small Sample Beta Correction (SSBC) adjusts the nominal risk level to attain a desired probably approximately correct (PAC) guarantee, by selecting $\alpha_\text{adj} < \alpha_\text{target}$ so that (with probability $1-\delta$ over the calibration set) at least $(1-\alpha_\text{target})$ coverage is achieved.

6. Extensions, Applications, and Limitations

Extensions

Function-valued outputs: For neural operators and other infinite-dimensional regression settings, conformal calibration is achieved through discretization. Guarantees are lifted to the function space by controlling discretization error, calibration statistical error, and model misspecification.
Circular data: Adapting conformity scores to angular prediction, with out-of-bag methods enabling unsupervised calibration in the absence of explicit calibration labels.
Distribution shifts: Adaptive techniques reweight calibration data using domain classifiers or similarity in embedding space to guard coverage under subpopulation or covariate shift. Optimal transport principles (e.g., 1-Wasserstein distance between calibration and test scores) provide bounds on the loss of coverage due to arbitrary distributional changes.
Healthcare, vision–language, and low-data regimes: Split conformal prediction augments black-box deep models in critical domains, with unsupervised or weakly supervised calibration approaches filling the gap when labeled calibration sets are limited or unavailable.

Limitations

Computational cost: Unsupervised calibration leveraging optimization (e.g., in the kernelized setup) introduces nontrivial computational overhead, particularly as the number of classes or calibration samples grows.
Trade-off in coverage: Replacing labels with predictions may degrade the guarantee by the model’s inaccuracy rate. The coverage bound $1-\alpha-\beta$ is sharp.
Calibration under strong dependence or adversarial contamination: Performance guarantees degrade as the dependence or contamination severity increases; additional corrections or slack are required to avoid undercoverage.

7. Unified Perspective and Practical Implications

Split conformal prediction with unsupervised calibration constitutes a powerful, distribution-free approach to uncertainty quantification, requiring minimal model assumptions and offering tangible guarantees even in semi-supervised, non-IID, or data-limited scenarios. Core advantages include computational efficiency, adaptability to arbitrary predictive models, and robustness under moderate violations of exchangeability or data contamination, which is further enhanced by domain adaptation or optimal transport-based weighting. The framework is applicable to both regression and classification, and adapts to function-valued and structured data. By leveraging empirical quantiles, weighting strategies, or auxiliary objectives (e.g., self-supervised loss), unsupervised calibration mechanisms ensure prediction sets or distributions are reliable and informative for real-world deployment where labeled resources are constrained.

A summary of the principal split conformal prediction with unsupervised calibration methodologies and guarantees is provided in the table below:

Methodology	Calibration Data Type	Key Guarantee / Feature
Split-conformal with labels	Labeled	$P(\text{coverage}) \geq 1-\alpha$
Unsupervised/pseudo-label	Unlabeled	$P(\text{coverage}) \geq 1-\alpha-\beta$
Weighted/optimized calibration	Unlabeled	Coverage gap $O(\text{model error} +$ complexity $)$
Self-supervised augmentation	Labeled/Unlabeled	Improved adaptive efficiency; maintains finite-sample
Cross-conformal methods	Labeled/Unlabeled	Improved data efficiency, approximate marginal coverage

This comprehensive framework demonstrates the versatility of split conformal prediction with unsupervised calibration, highlighting its theoretical rigor, practical adaptability, and the fundamental role of calibration set construction—be it labeled, weighted, or pseudo-labeled—in reliable, model-agnostic uncertainty quantification.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Split Conformal Prediction with Unsupervised Calibration.