Credal Model Averaging (CMA)
- Credal Model Averaging (CMA) is a generalization of Bayesian Model Averaging that uses a set of priors to capture epistemic uncertainty and produce interval-valued predictions.
- It has been applied in statistical model selection, ensemble methods like SPODE/AODE, and deep learning frameworks to enhance decision robustness under uncertain conditions.
- CMA integrates rigorous sensitivity analysis with practical performance metrics such as indeterminacy rates and improved calibration, offering a principled approach for risk-averse predictions.
Credal Model Averaging (CMA) is a generalization of Bayesian Model Averaging (BMA) that replaces the single prior over model space with a set of priors (a "credal set") to rigorously address prior uncertainty and propagate epistemic uncertainty through to model predictions. CMA has been applied in classical statistical model selection, ensemble methods such as SPODEs, and deep learning architectures including Bayesian neural networks and deep ensembles. Its main objectives are reliable classification under model uncertainty and explicit quantification of the sensitivity of predictions to prior assumptions (Corani et al., 2014, Corani et al., 2012, Wang et al., 2024).
1. Formalization of Credal Model Averaging
Let denote a finite model space. Given data , Bayesian Model Averaging predicts the probability of a class label as: where . is the prior over models; is the model marginal likelihood or Bayesian Evidence.
CMA replaces with a credal set : Predictions become interval-valued:
where . The credible set of priors provides automatic sensitivity analysis and formalizes both prior-ignorance and domain-informed prior elicitation (Corani et al., 2014).
2. CMA Design in Classical and Ensemble Methods
CMA algorithms have been developed for various model classes, including logistic regression ensembles and SPODE-based ensembles.
Logistic Regression Ensembles (Corani et al., 2014)
- CMA (Independent-Bernoulli): The prior on model size parameter specifies , where is the number of covariates in model . Near-ignorance is encoded with .
- CMA (Non-identical Bernoulli): Each covariate has its own inclusion probability interval , such that . Expert opinion is incorporated by narrowing these intervals per covariate.
SPODE/AODE Ensembles (Corani et al., 2012)
- In the SPODE context, class-posterior aggregation using BMA is dominated by a single model, degrading performance. To overcome this, compression-based averaging applies a logarithmic smoothing to the posteriors. CMA replaces the unique model prior with credal sets:
yielding interval-valued class posteriors and robust (maximality/interval-dominance based) decisions.
Deep Ensemble Credal Wrappers (Wang et al., 2024)
- Given decision-distribution outputs (e.g., deep ensemble predictions), the class interval is
The credal set is the convex polytope of all distributions within for each class.
3. Decision Rules and Prior-Dependent Instances
CMA supports interval-valued predictions and robust decisions.
- Interval-dominance rule (binary case): For classes , , predict if , if , otherwise remain indeterminate ().
- Prior-dependent instance: Any for which . In such cases, CMA returns the set of plausible classes, thereby suspending judgment where model-prior arbitrariness would produce brittle results (Corani et al., 2014, Corani et al., 2012).
In multiclass problems (e.g., SPODE/AODE), maximality rules and interval-dominance are used to define non-dominated class sets based on lower/upper bounds optimized over the credal set.
- Intersection-probability transformation in deep ensemble settings selects a single normalized distribution lying on the line between the lower and upper corners of the credal polytope:
where normalizes the sum to one.
4. Elicitation of Prior Ignorance and Expert Knowledge
CMA schemes vary the width of the credal set to encode different sources of prior information.
- Prior-ignorance: In CMA, using the widest feasible interval for () encodes minimal prior knowledge.
- Expert knowledge: In CMA, expert-elicited intervals for each covariate sharply limit the credal set and enable more informative and less conservative inference.
- In ensemble contexts, the prior credal set can be defined by lower bounds on each model or covariate inclusion probability (Corani et al., 2014, Corani et al., 2012).
A relevant implication is that the trade-off between robustness (wider credal sets) and informativeness (narrower credal sets) is tunable, and empirical results show that expert-informed priors yield sharper inferences for covariates strongly supported by data (Corani et al., 2014).
5. Reliability, Utility, and Performance Metrics
CMA metrics extend conventional ML evaluation to account for interval-valued and set-valued outputs (Corani et al., 2014, Corani et al., 2012, Wang et al., 2024):
- Indeterminacy rate: Fraction of instances for which CMA prediction is set-valued (i.e., prior-dependent).
- Accuracy on determinate instances: Performance where CMA gives a single class.
- Set-accuracy: Proportion of times the returned set of classes contains the true label.
- Quadratic utility (, ): Rewards correct set-valued predictions based on their cardinality.
- AUC, recall, Brier loss: Determinate metrics for compatibility with BMA and standard classifiers.
- Expected Calibration Error (ECE), AUROC, AUPRC: Used especially in deep ensemble applications to measure calibration and OOD detection ability.
Key empirical findings:
- CMA's indeterminacy rate is $5$–$8$% for independent-Bernoulli priors, up to $20$% for non-identical Bernoulli priors; it decreases with sample size.
- On prior-dependent instances, BMA's accuracy degrades to random-guessing levels (≈$60$%), whereas CMA retains high reliability by abstaining or returning sets.
- Under conservative utility (), CMA always outperforms BMA in risk-averse scenarios; under , the most imprecise (broadest) CMA configuration is optimal when data is scarce.
- In deep learning, the credal wrapper improves EU-based AUROC for OOD detection by $2$–$4$ points and reduces ECE by $20$–$30$% compared to vanilla ensemble/BMA softmax averaging (Wang et al., 2024).
6. Algorithmic Procedure and Computational Considerations
In classical CMA for model ensembles:
- Posterior intervals for classes or models are computed by optimizing linear-fractional (BMA) or non-linear (compression-weighted) functions over the credal set, with constraints given by minimum prior weights.
- For logistic regression, independent-Bernoulli CMA reduces intervals to 1D optimizations, while non-identical Bernoulli CMA requires multi-dimensional (but tractable) optimization, commonly solved via NLopt (Corani et al., 2014).
- In credal-ensemble meta-algorithms (SPODEs), BMA-based CMA and compression-based CMA only add a classification-time overhead (for small optimization subproblems); overall computational complexity remains polynomial in the number of models and classes (Corani et al., 2012).
In deep learning, the credal wrapper calculation is efficient:
- Lower and upper class bounds are computed via -wise minima/maxima.
- The intersection-probability transformation is closed-form and efficient for inference, although some measures of generalized uncertainty require more computational resources (Wang et al., 2024).
7. Applications, Extensions, and Theoretical Properties
CMA and its extensions have been deployed in diverse settings:
- Ecological inference: Predicting Alpine marmot burrow presence in a heterogeneous landscape (Corani et al., 2014).
- Tabular datasets: 40 UCI-like datasets, demonstrating reliability and calibration improvement over standard AODE, BMA, and COMP-AODE (Corani et al., 2012).
- Deep learning OOD detection: Tasks on CIFAR, ImageNet, and variants showing improved calibration and uncertainty quantification (Wang et al., 2024).
Theoretical properties:
- Reliability: CMA maintains high set-accuracy and reliability on prior-dependent instances, never making confident errors.
- Sensitivity analysis: CMA automates and formalizes the quantification of prior-sensitivity in model-based prediction.
- Complexity: Additional computational cost is confined to classification time but remains tractable for practical ensemble sizes and moderate feature/label cardinality.
- Convexity: The credal set forms a convex polytope in probability simplex, supporting principled operations such as the intersection-probability transform.
A plausible implication is that as high-dimensional and ensemble-based models become more prevalent, credal methods offer a principled framework for epistemic uncertainty management, with potential for seamless integration into risk-averse decision-making and active learning. Proposed future work includes efficient entropy approximation on credal polytopes, extending CMA to regression and structured outputs, and leveraging interval widths for active data acquisition (Wang et al., 2024).