Stacked Model Probability: Methods and Insights

Updated 10 February 2026

Stacked model probability is the optimal combination of candidate predictions into a convex mixture weighted by cross-validated performance under proper scoring rules.
It employs computational strategies like PSIS and hierarchical weighting to efficiently approximate leave-one-out predictive densities for robust model evaluation.
Theoretical guarantees such as no-worse-than-selection, elpd dominance, and asymptotic consistency underpin its superior predictive performance.

Stacked model probability refers to the optimal combination of predictive distributions (or point predictions) from a collection of candidate models. The combination is expressed as a convex mixture with weights chosen to maximize predictive accuracy, typically under a proper scoring rule and often evaluated via cross-validation. Stacked probabilities generalize and often outperform model selection and Bayesian model averaging, especially in situations where the true data-generating mechanism is not among the candidate models. Recent work provides rigorous theoretical justification, computational strategies, and extensions to hierarchical and non-linear weighting schemes, as well as generalizations beyond linear pooling.

1. Mathematical Formulation of Stacked Model Probability

Let $\{M_1, \ldots, M_K\}$ denote $K$ candidate models for data $D$ , each defining a predictive density $p_k(\hat y|D)$ . The classical stacking procedure constructs the aggregated predictive density as a linear pool: $p_{\text{stack}}(\hat y|D) = \sum_{k=1}^K w_k\, p_k(\hat y|D),$ where $w = (w_1, \ldots, w_K)$ lies on the probability simplex, i.e., $w_k \geq 0$ , $\sum_k w_k = 1$ (Yao, 2019, Yao et al., 2017). The stacking weights are determined by optimizing predictive performance under a proper scoring rule—principally the log-score: $w^* = \arg\max_{w \in \Delta^{K-1}} \frac{1}{n} \sum_{i=1}^n \log\left(\sum_{k=1}^K w_k\, p_k(y_i | D_{-i})\right),$ where $p_k(y_i | D_{-i})$ is the leave-one-out (LOO) predictive density for $y_i$ , estimated without the $i$ th data point (Yao, 2019, Wadsworth et al., 4 Sep 2025). For point predictions, stacking of means instead minimizes squared error over LOO predictions.

2. Computational Strategies and Implementation

Exact computation of LOO predictive densities is often infeasible, especially for complex hierarchical models. Pareto-smoothed importance sampling (PSIS) is the common approximation: full-data posterior draws $\theta_k^s$ are reweighted to estimate $p_k(y_i|D_{-i})$ using smoothed importance ratios (Yao, 2019, Yao et al., 2017, Yao et al., 2020). The stacked weights are then obtained by convex optimization constrained to the simplex (e.g., via L-BFGS-B or projected gradient methods).

For special model classes, such as conjugate linear-Gaussian spatial models, closed-form posteriors and LOO predictors obviate the need for MCMC, allowing fast, fully parallelizable stacking procedures (Zhang et al., 2023).

Stacked Gaussian process (StackedGP) models exploit nodewise independence and analytic moment propagation to efficiently compute output means and variances through arbitrary depth/layered network compositions (Abdelfatah et al., 2016).

Hierarchical or input-dependent stacking generalizes fixed weights to functions $w_j(x)$ , parameterized (e.g., by a softmax over basis expansions or Gaussian processes) and inferred via full Bayesian modeling, typically with MCMC or variational inference (Yao et al., 2021).

3. Theoretical Guarantees and Optimality

Stacked model probabilities inherit several rigorous properties:

No-worse-than-selection: The stacking ensemble achieves expected predictive utility at least as large as the best single model asymptotically (Yao, 2019, Yao et al., 2017).
Elpd dominance: The difference in expected log predictive density (elpd) between stacking and the best constituent is bounded below, quantifying the gain achievable under model heterogeneity (Yao et al., 2021).
Interpretability under separability: If the candidate models are locally well-separated over input/output space, the stacking weight for model $j$ approximates the probability its region is the best fit (i.e., $w_j^{\rm stacking} \approx \Pr((x,y)\in \mathcal J_j)$ ) (Yao et al., 2021).
Asymptotic consistency: Under regularity, stacking weights converge to the minimizer of posterior expected loss (e.g., squared error) even in $\mathcal M$ -complete or $\mathcal M$ -open settings where no single true model exists (Le et al., 2016).

Stacked survival models admit analogous guarantees for mean squared error decomposition, ultimately converging to the true conditional survival function if at least one candidate estimator is consistent (Wey et al., 2013).

For Gibbs-posterior stacking, if the scoring rule (e.g., CRPS) admits a unique global minimizer for risk, the posterior on weights concentrates at this minimizer as $n\rightarrow\infty$ (Wadsworth et al., 4 Sep 2025).

4. Generalizations Beyond Linear Convex Pooling

Classical linear stacking can be generalized or regularized using:

Gibbs posteriors: The empirical risk under a proper scoring rule $S$ yields the action posterior

$\pi_n^{(\eta)}(\omega) \propto \exp\{-\eta n R_n(\omega)\} \pi(\omega)$

where $\omega$ are weights, $\pi$ is a Dirichlet prior, and $R_n$ is the empirical scoring-rule risk (e.g., CRPS) (Wadsworth et al., 4 Sep 2025). Uncertainty in weight estimation is thus propagated to the ensemble predictive.

Log-linear pooling ("locking") and quantum superposition ("quacking"): These represent, respectively, log-convex and "superposed" (complex-amplitude) mixtures of predictive densities. The weights (and phases, for quacking) are fit using the Hyvärinen score, which avoids intractable normalizing constants by depending only on derivatives of the log unnormalized density (Yao et al., 2023). Locking preserves unimodality in predictions if all components are log-concave, and quacking can produce new intermediate modes.
Hierarchical and covariate-dependent weights: Stacking weights can be parameterized as functions of input features (via softmax over basis expansions, splines, or GPs), with partial pooling priors for adaptivity and regularization (Yao et al., 2021).

5. Empirical Performance and Applications

Stacked model probabilities consistently yield robust predictive performance across a variety of domains:

Geostatistics: Predictive stacking in closed-form Gaussian spatial models achieves MCMC-equivalent predictions at a fraction of computational cost, with robust infill asymptotic behavior under misspecification (Zhang et al., 2023).
Time series and forecast ensembles: Stacked Gibbs-posterior ensembles outperform BMA, adaptive selection, and equal-weight baselines in both simulations and real-world competitions (e.g., CDC FluSight), with grounded calibration and uncertainty quantification (Wadsworth et al., 4 Sep 2025).
Classification and regression: Probability-vector stacking with meta-learners (e.g., multiresponse least-squares regression) strictly dominates stacking of hard labels and surpasses both best single model and unweighted voting in error reduction (Ting et al., 2011). Mean squared error decomposition confirms systematic gains across scenarios for survival analysis and regression (Wey et al., 2013).

Stacked Gaussian processes enable uncertainty propagation and flexible forward uncertainty quantification in multilayered systems, finding use in environmental modeling (Abdelfatah et al., 2016).

6. Limitations, Extensions, and Open Issues

Critical assumptions and limitations for stacked model probability include:

Dependence on differentiability: Extensions using the Hyvärinen score require twice-differentiable predictive densities for all models, making direct application to discrete data non-trivial (Yao et al., 2023).
Sufficiency of model library: The ensemble is only as good as its components; stacking cannot compensate for an entirely misspecified or uninformative model set. However, it is robust to the inclusion of redundant or poor models due to its proper scoring rule optimization (Yao, 2019).
Multimodality and posterior computation: For highly multimodal posteriors, stacking can approximate mixtures of modes even when full Bayesian posteriors fail due to non-mixing, yielding improved predictive (cross-validation) performance (Yao et al., 2020).
Computational scaling: While stacking is embarrassingly parallel in model fitting and prediction, the combinatorial cost of high-dimensional or hierarchical stacking with structured priors may be nontrivial. For convolutional and deep architectures, further methodological development is ongoing.

Active research directions include regularization toward uniform weights (especially in high-variance or weak-signal regimes), development of efficient algorithms for sampling from non-linear mixture posteriors (locking, quacking), and formal extension to non-differentiable or discrete-output model classes (Yao et al., 2023, Wadsworth et al., 4 Sep 2025).