CAOS: Conformal Aggregation of One-Shot Predictors

Updated 11 January 2026

The paper introduces CAOS, a unified framework that aggregates limited-resource conformal predictors using principled p-value and score-level fusion to ensure finite-sample coverage guarantees.
It leverages diverse aggregation methods—such as p-value combination, symmetric score-level fusion, and quantile-of-quantiles—to enhance prediction efficiency and robustness across multi-view, federated, and ensemble settings.
Practical applications include multi-input classification, federated learning, and distributional robustness, with empirical findings demonstrating substantial reductions in prediction set sizes while maintaining coverage.

Conformal Aggregation of One-Shot Predictors (CAOS) is a collection of statistical methodologies that enable the principled aggregation of predictions, prediction sets, and uncertainty quantification from multiple “one-shot” or limited-resource conformal predictors. CAOS leverages both p-value and score-level aggregation—in settings ranging from multi-view classification and federated learning to multi-source distributional robustness and model ensemble uncertainty quantification—while providing finite-sample, distribution-free marginal or class-conditional coverage guarantees. The unifying principle is to formally combine nonconformity information or conformal p-values from distinct predictors, tasks, views, or sources, so as to achieve improved efficiency (smaller prediction sets or intervals) while rigorously retaining operational coverage under realistic assumptions.

1. Foundational Principles and Problem Settings

CAOS generalizes traditional @@@@1@@@@ to settings where multiple predictive “views,” models, data sources, or calibration splits are available, each yielding a one-shot or limited-resource predictor. The core objective is to construct a single aggregate prediction set—typically a subset of the label or response space—retaining rigorous statistical coverage, such as

$\Pr\{ Y \in \widehat C(X) \} \geq 1-\alpha$

in the marginal or class-conditional sense.

Key settings addressed include:

Multi-input/multi-view classification: Multiple observations (e.g., citizen science images of a single specimen), each providing complementary evidence for the (unknown) ground-truth label (Fermanian et al., 9 Jul 2025).
One-shot federated and distributed learning: Multiple agents/clients, each with a disjoint calibration set, aggregate their local conformal statistics at a server without data sharing (Humbert et al., 2023, Humbert et al., 2024).
Ensemble/model aggregation: Aggregating nonconformity information across predictive models trained on the same task (Alami et al., 7 Dec 2025).
Multi-distribution robustness: Aggregating across sources with different underlying distributions to achieve uniform validity (Yang et al., 6 Jan 2026).
Mixture-of-experts and mixture weighting: Aggregating multiple experts or modes with either fixed or data-dependent weights (Wong et al., 17 May 2025).

Universal to CAOS is the treatment of each “component” as a one-shot conformal module: that is, a predictor or view that can, for a test input, yield a nonconformity score distribution or a conformal p-value for each candidate label or response.

2. Aggregation Methodologies

At the heart of CAOS are statistically principled aggregation schemes for fusing information from several one-shot predictors:

a) P-Value Aggregation

For each candidate label $y$ , each predictor or view $j$ produces a conformal p-value $p_{j,y}$ . These are aggregated through a non-increasing function $g \colon [0,1]^k \to [0,1]$ , to obtain an aggregate p-value $p_{\rm agg}(y) = g(p_{1,y}, ..., p_{k,y})$ . Classical choices for $g$ include the Bonferroni min, Simes’ procedure, Fisher’s method, order-statistic-based quantiles, and other non-increasing combiners. The aggregated set

$\widehat C_\alpha(x_1,\ldots,x_k) = \{ y : p_{\rm agg}(y) > \alpha \}$

is then used for final prediction (Fermanian et al., 9 Jul 2025).

b) Score-Level Symmetric Aggregation

Given nonconformity scores from multiple predictors, scores are normalized (e.g., to e-values) and combined with any symmetric aggregation function (sum, product, order statistic, power sum, etc.). The resulting statistic is calibrated via standard split- or full-conformal prediction, ensuring coverage for the aggregated set (Alami et al., 7 Dec 2025). This subsumes “set-level” (e.g., majority-vote, union, or intersection) CAOS as a special case.

c) Quantile-of-Quantiles (QQ) Construction

Each participant computes a local quantile of its nonconformity scores, and the server aggregates these quantiles via another quantile operator (over agents) (Humbert et al., 2023, Humbert et al., 2024). The threshold is: $\hat Q_{(\ell,k)} = \text{%%%%0%%%%-th smallest}\{\ell\text{-th local quantile from each agent}\}$ which provides a tight, finite-sample guarantee for both marginal and training-conditional coverage.

d) Max-p and Weighted Aggregation

For multi-source or robustness settings, the aggregate p-value is the maximum across sources ( $p(x,y) = \max_k p^{(k)}(x,y)$ ), yielding finite-sample uniform validity across all distributions. Alternatively, weighted sums or convex combinations of p-values, optionally with data-dependent or learned weights $w_i(x)$ , allow interpolation between worst-case robustness and efficiency (Wong et al., 17 May 2025, Yang et al., 6 Jan 2026).

3. Algorithmic Workflow and Computational Considerations

The canonical algorithmic pipeline for CAOS consists of:

Compute local nonconformity scores or p-values: For each predictor/view/agent and each candidate label or response, evaluate a nonconformity measure $s_j(x, y)$ and form conformal p-values against held-out calibration data.
Aggregate scores or p-values: Use the chosen aggregation rule $g$ , symmetric function $f$ , QQ operator, or weighted sum/max to yield a single statistic per label.
Calibrate via empirical quantiles: For the aggregated score across calibration data, select the appropriate empirical quantile to control the miscoverage level $\alpha$ with finite-sample correction.
Return the conformal prediction set: Final prediction set consists of labels/responses for which the aggregated statistic exceeds the threshold.

Computational costs are dominated by sorting and aggregation at either the score or p-value level. For vectorized implementations, all operations can be batched over labels and predictors. In federated settings, communication cost is minimized to a single round of one scalar per agent (Humbert et al., 2023, Humbert et al., 2024). Memory is typically limited by the storage of calibration scores.

4. Theoretical Guarantees

CAOS methods provide exact or near-exact finite-sample marginal and/or class-conditional coverage guarantees under minimal assumption sets. The main theoretical statements include:

Class-conditional Coverage: For p-value aggregation, if the aggregated p-value $p_{\rm agg}(y)$ is constructed via any non-increasing function of marginally uniform inputs under the true label, the prediction set satisfies

$\Pr(y \in \widehat C_\alpha(x_1, ..., x_k) \mid Y = y) \ge 1 - \alpha$

for each class $y$ (Fermanian et al., 9 Jul 2025).

Marginal and Training-Conditional Validity: Quantile-of-quantiles methods admit exact expressions for the coverage probability as a Beta–Beta order-statistic (or via concentration bounds) and can be tuned for both marginal and PAC-style training-conditional guarantees (Humbert et al., 2023, Humbert et al., 2024).
Uniform Multi-Distribution Coverage: Max-p aggregation ensures that, for all source distributions (and arbitrary test-time mixtures), the constructed set achieves worst-case coverage exceeding $1 - \alpha$ (Yang et al., 6 Jan 2026).
Efficiency Bounds: Aggregated sets via symmetric functions enjoy worst-case set-length bounds proportional to the maximal per-model set size and model disagreement, and empirical minimization over aggregation rules enables further tightening (Alami et al., 7 Dec 2025).
Weighted Aggregation and Adaptive Coverage: For convex combinations of p-values, coverage falls between $1 - 2\alpha$ (uniform weights) and $1 - \alpha$ (single weight dominance), and empirical correction factors maintain validity even with data-dependent weights (Wong et al., 17 May 2025).

5. Practical Applications and Experimental Findings

Applications of CAOS span diverse settings with consistent empirical improvements in efficiency, coverage, or both:

Multi-view/multi-input classification (e.g., Pl@ntNet plant identification): Aggregating p-values across views reduces average set size dramatically (e.g., from ~100 to ~30 species in 3–5 images) while preserving class-conditional coverage (Fermanian et al., 9 Jul 2025).
Federated and distributed learning: One-round communication methods (QQ estimators) outperform naive averaging and match centralized conformal predictors in both coverage and set size. Differential privacy can be integrated via local DP mechanisms (Humbert et al., 2023, Humbert et al., 2024).
Ensemble model aggregation: Symmetric score-level aggregation (SACP/CAOS++) achieves set-length reduction of up to 20%–40% relative to better individual models or naive set union, consistently retaining nominal coverage on regression and image-classification benchmarks (Alami et al., 7 Dec 2025).
Distributional robustness/fairness: Max-p CAOS yields prediction sets uniformly valid across arbitrary mixtures of source domains, supporting fair resource allocation and sub-population shift without label knowledge at test time (Yang et al., 6 Jan 2026).
Mixture-of-experts: Weighted p-value aggregation with learned weights ensures local adaptivity and closes coverage gaps in regions where standard split-conformal fails, especially on underrepresented subgroups (Wong et al., 17 May 2025).

6. Design Choices and Limitations

Fundamental design decisions in CAOS include:

Choice of nonconformity score (compatibility with model, label structure, and calibration set size).
Aggregation function: Bonferroni for arbitrary dependence (conservative), Simes/Fisher for positive dependence/independence scenarios, score-level symmetric aggregates for data-driven efficiency, and quantile-of-quantiles for federated regimes.
Calibration sample size: Coverage guarantees are asymptotic in per-class or per-agent calibration size. Recommended $n \gg 1/\alpha$ to control quantile estimation error (Fermanian et al., 9 Jul 2025).
Weight learning strategy: For weighted aggregation, careful cross-validation or held-out adaptive calibration is advised to prevent overfitting and to provide accurate correction for empirical p-value inflation.

Limitations include the potential conservativeness with large numbers of predictors in Bonferroni-style schemes, and challenges in realizing optimal efficiency when component predictors are heterogeneous or highly correlated.

7. Connections and Theoretical Extensions

CAOS methodology unifies several lines of research in conformal prediction, robust statistics, and federated learning:

Distributionally robust optimization: Uniform coverage across mixtures of sources generalizes the classical worst-case risk principle (Yang et al., 6 Jan 2026).
Group and sub-population fairness: Guarantees at the group or subgroup level (demographic slices, domain adaptation tasks) are formalized via the multi-source CAOS framework.
Ensemble and score-level fusion: SACP shows that detailed nonconformity information can always be more efficiently aggregated at the score level, and that set-level CAOS (majority, intersection, union) is a special, often conservative, case (Alami et al., 7 Dec 2025).
Conditional inference and adaptive coverage: Weighted p-value CAOS methods interpolate between global and expert-local coverage, supporting adaptivity in modern mixture-of-experts systems (Wong et al., 17 May 2025).

These frameworks enable systematic, theoretically justified advances in uncertainty quantification that are robust to data fragmentation, heterogeneity, and limited calibration resources, and are matched by broad empirical validation.