Conditional Coverage Diagnostics for Conformal Prediction (2512.11779v1)

Published 12 Dec 2025 in stat.ML, cs.AI, and cs.LG

Abstract: Evaluating conditional coverage remains one of the most persistent challenges in assessing the reliability of predictive systems. Although conformal methods can give guarantees on marginal coverage, no method can guarantee to produce sets with correct conditional coverage, leaving practitioners without a clear way to interpret local deviations. To overcome sample-inefficiency and overfitting issues of existing metrics, we cast conditional coverage estimation as a classification problem. Conditional coverage is violated if and only if any classifier can achieve lower risk than the target coverage. Through the choice of a (proper) loss function, the resulting risk difference gives a conservative estimate of natural miscoverage measures such as L1 and L2 distance, and can even separate the effects of over- and under-coverage, and non-constant target coverages. We call the resulting family of metrics excess risk of the target coverage (ERT). We show experimentally that the use of modern classifiers provides much higher statistical power than simple classifiers underlying established metrics like CovGap. Additionally, we use our metric to benchmark different conformal prediction methods. Finally, we release an open-source package for ERT as well as previous conditional coverage metrics. Together, these contributions provide a new lens for understanding, diagnosing, and improving the conditional reliability of predictive systems.

Summary

The paper introduces a functional reformulation of conditional coverage as a classification task using ERT metrics to diagnose miscoverage.
The paper demonstrates that advanced classifiers significantly enhance statistical power and sample efficiency compared to traditional partition-based methods.
The paper reveals a tradeoff between improved conditional coverage and larger prediction sets, emphasizing practical implications for safety and fairness.

Conditional Coverage Diagnostics for Conformal Prediction

Introduction and Motivation

Conformal prediction (CP) provides distribution-free prediction sets with guaranteed marginal coverage under the exchangeability assumption, making it an attractive framework for reliable uncertainty quantification in modern predictive modeling. However, marginal coverage only ensures coverage on average over the population, and does not guarantee correct coverage for subpopulations or individual feature values—a property known as conditional coverage. Conditional validity is essential for safety and fairness in high-stakes applications but is unattainable without strong assumptions. Moreover, diagnosing violations of conditional coverage in practice remains an unresolved challenge due to the inadequacy of existing metrics, which either rely on coarse partitioning or are prohibitively sample-inefficient in higher dimensions.

Reformulating Conditional Coverage Assessment

This work introduces a functional reformulation of conditional coverage diagnostics by casting the problem as a supervised binary classification task. CP prediction sets are treated as black boxes outputting indicator labels $Z = \{Y \in C_\alpha(X)\}$ . The conditional coverage probability function $p(x)=\mathbb{P}(Y \in C_\alpha(X)\mid X=x)$ is estimated via regression-classification, bypassing partition-based limitations. Critically, if a learned classifier $h$ can surpass the risk of the constant baseline predictor $1-\alpha$ under a proper loss $\ell$ , this signals a systematic violation of conditional coverage.

To formalize this, the Excess Risk of the Target coverage (ERT) family is introduced:

$\ell\textrm{-ERT} = R_\ell(1-\alpha) - R_\ell(h^*)$

where $R_\ell(h)$ denotes the expected loss for classifier $h$ , and $h^*$ is the Bayes-optimal predictor. ERT metrics parameterized by different proper losses $\ell$ provide lower bounds on natural miscoverage measures, including $L_1$ , $L_2$ , and KL divergences, and can be further decomposed to isolate under- and over-coverage components.

Figure 1: Illustration of conditional coverage estimation; true conditional coverage, ERT-estimated coverage, and comparison with partition-based methods and target coverage.

Practical Estimation Protocol

ERT metrics are computed on held-out samples using cross-validation to eliminate overfitting, leveraging state-of-the-art tabular classifiers (gradient-boosted trees, tabular foundation networks, etc). This approach is adaptable; the effectiveness of conditional coverage violation detection improves with classifier expressivity, as strong learners can approximate the conditional event probability $p(x)$ more tightly.

The paper confirms empirically that ERT metrics with strong classifiers exhibit superior statistical power and sample efficiency compared to group-based (CovGap, WCovGap, FSC, EOC, SSC) and geometry-based (WSC) diagnostics, whose effectiveness collapses in high-dimensional regimes or with limited data.

Figure 2: $L_1$ -ERT estimation for different classifiers on two datasets; improvement in statistical power is evident for advanced classifiers relative to partition-based estimators.

Figure 3: Additional datasets show similar trends—advanced classifiers yield more accurate ERT diagnostics, outperforming partition-based methods as the number of samples increases.

Theoretical Generalizations

ERT subsumes prior nonparametric approaches as special cases—partition-based gap estimators are recoverable by restricting the classifier class to piecewise constants. Furthermore, the ERT framework enables flexible estimation of general convex distances and supports extension to adaptive (feature-dependent) coverage targets, thus subsuming scenarios with non-uniform or context-aware coverage prescriptions.

The construction also supports asymmetric diagnostics, decomposing miscoverage into contributions from over- and under-coverage via convex function splitting, thus improving interpretability and supporting targeted methodological refinement.

Experimental Results: Synthetic and Real Data

On controlled synthetic data—where oracles produce conditionally valid sets and naive CP methods do not—the ERT metrics rapidly align with theoretical values and correctly flag miscoverage with orders-of-magnitude fewer samples than group-based or scanning-geometric diagnostics.

Figure 4: Sample size versus estimated metric values (top: WSC, middle: CovGap, bottom: $L_1$ -ERT); functional ERT metrics converge rapidly, while group-based metrics are unstable and require excessive data.

Figure 5: Visualizations: In the naive scenario, ERT identifies regions of both under- and over-coverage missed by partitions; in the oracle scenario, ERT validates conditional validity robustly.

On real regression and classification benchmarks—ranging from univariate to high-dimensional multivariate tasks—the authors present a systematic comparison of CP procedures using ERT and classical diagnostics.

Figure 6: Multi-dataset averages of $L_1$ -ERT (left) and WCovGap (right) for regression; ERT metrics not only reveal larger violations but more consistently differentiate between methods.

Figure 7: Contrasting $L_2$ -ERT (left) and WSC (right); ERT metrics offer fine-grained differentiation of conditional coverage quality, while WSC can be anti-correlated in high dimensions.

Figure 8: Normalized prediction set sizes; aggressive conditional coverage correction is often linked to larger volumes, highlighting the coverage-volume tradeoff.

Key findings include:

ERT metrics, particularly with $L_1$ losses, provide robust and conservative estimates of conditional miscoverage, requiring significantly fewer samples to uncover true violations, especially in non-homogeneous or high-dimensional covariate regimes.
Strong classifiers (gradient boosting, tabular foundation networks) substantially enhance detection power relative to partition-based and unsophisticated methods.
There is a demonstrated tradeoff between improved conditional coverage and larger prediction set sizes, emphasizing the necessity of joint assessment for practical deployment.

Implications, Limitations, and Future Directions

The main implication of this functional diagnostic approach is improved auditability and interpretability of CP-based uncertainty quantification. By shifting the locus of assessment from marginal to conditional measures supported by rich classifier classes, practitioners can robustly identify and localize miscoverage, which is critical for fairness, regulatory compliance, and safe deployment.

However, the effectiveness of ERT hinges on the capacity of the chosen classifier—if the model class is underpowered relative to the complexity of the conditional pattern, true violations may go undetected. This classifier-dependency introduces a new axis of methodological risk, but also aligns with the broader perspective of treating coverage diagnostics as hypothesis tests parameterized by classifier capacity.

From a theoretical perspective, unifying coverage diagnostics with risk minimization and functional estimation supports future developments at the interface of model-based and nonparametric inference. Practical directions include adapting ERT metrics for image, text, and time series via embedding regressors, tuning model classes for optimal power, and integrating coverage diagnostics into end-to-end (e.g., federated or privacy-preserving) learning systems.

Conclusion

This work provides an operational and theoretically justified functional framework for diagnosing conditional coverage in conformal prediction. By leveraging modern classifiers to estimate the excess risk of the target coverage, the approach supersedes traditional metrics in statistical efficiency and interpretability. Empirical results confirm that principled functional diagnostics can drive methodological advances and ensure reliability where marginal coverage guarantees alone are insufficient. The release of reproducible, open-source tooling will further facilitate the adoption and evolution of these techniques in both academic and industrial contexts.