Deep Ensembles
- Deep ensembles are a method that aggregates predictions from multiple independently trained neural networks, improving accuracy and uncertainty estimation.
- They decompose uncertainty into aleatoric and epistemic components, enabling reliable predictive intervals and enhanced risk assessment.
- Empirical studies indicate that increasing model capacity often outperforms diversity-promoting strategies, leading to better calibration and efficient inference.
Deep ensembles are a foundational approach in modern deep learning for improving predictive performance, uncertainty quantification, and robustness through the aggregation of independently trained neural networks. In practice, a deep ensemble consists of multiple models, typically with shared architectures but distinct random weight initializations, which are trained independently on the same task. At inference, each model produces a prediction, and their outputs are combined—usually by simple averaging—to yield the ensemble prediction. Deep ensembles have been extensively employed across diverse domains, including statistical downscaling of climate models, human motion forecasting, and low-data transfer learning, and have become a core comparative baseline for advances in uncertainty quantification, out-of-distribution (OOD) detection, and Bayesian neural inference.
1. Core Formulation and Uncertainty Decomposition
The canonical deep ensemble comprises neural networks , independently initialized and trained. For probabilistic regression/classification tasks, each member outputs a predictive distribution (e.g., a Gaussian or softmax over classes). The aggregate ensemble predictive distribution is the uniform mixture: The ensemble predictive mean and variance are:
This decomposition allows explicit separation of:
- Aleatoric uncertainty: captured by within-network variance ,
- Epistemic uncertainty: quantified by the variance of means across ensemble members.
This explicit decomposition supports reliable construction of predictive intervals, crucial for applications such as risk assessment under climate change (González-Abad et al., 2023).
2. Theoretical Foundations and Bayesian Connections
Originally motivated by bootstrap aggregation (“bagging”), deep ensembles explore distinct basins of the neural network loss landscape via random initializations. Empirical evidence demonstrates that such independent optimization populates markedly different function-space modes, which is essential for uncertainty representation—contrasting sharply with Bayesian neural networks (BNNs) trained via variational inference, which typically capture only local uncertainties around a single mode (Fort et al., 2019).
Recent theoretical advances establish deep ensembles as a specific instance of empirical Bayes inference. The ensemble is equivalent to Bayesian model averaging under a learned prior concentrated on the ensemble members’ parameter values: Under this perspective, the predictive distribution of the ensemble is exact Bayesian averaging with respect to an empirical, data-dependent prior. This result provides a rigorous justification for the empirical success of deep ensembles and explains their superiority over BNNs with misspecified priors (Loaiza-Ganem et al., 29 Jan 2025, Hoffmann et al., 2021). Furthermore, deep ensembles can be rigorously interpreted as variational approximations on probability measure space, with Wasserstein gradient flows unifying standard ensembles, repulsive ensembles, and variational methods under a generalized optimization framework (Wild et al., 2023).
3. Role of Diversity and Optimization Interventions
While the variance-reduction benefit of ensembles is classically linked to diversity among their members, recent large-scale empirical studies establish that, in high-capacity deep networks, artificially increasing inter-model diversity through methods such as negative correlation learning (NCL), bagging, or heterogeneous architectures does not yield further improvements—and often harms performance compared to investing computational resources in growing model capacity (Abe et al., 2023, Abe et al., 2022). Counterintuitively, discouraging predictive diversity (selecting more similar models) is often benign or even beneficial in the overparameterized regime.
In the low-capacity (underparameterized) regime, however, the classical intuition holds—bagging, boosting, and NCL can meaningfully reduce ensemble error by decorrelating member errors. Opportunity-cost analyses repeatedly show that expanding single-model size outperforms diversity-promoting strategies for fixed compute, leading to the practical guideline: allocate additional resources to model capacity, not explicit diversity (Abe et al., 2023, Abe et al., 2022).
4. Training, Calibration, and Computational Strategies
Deep ensembles are straightforward to implement: train independent instances of the base architecture, typically differing only in random seed and data order. Best practices for deployment and calibration include:
- Ensemble-aware temperature scaling: Apply a single global temperature parameter to all member logits, optimized for ensemble Negative Log-Likelihood (NLL), rather than per-member scaling (Fredsgaard et al., 6 Nov 2025).
- Joint early stopping: Monitor ensemble NLL on a shared holdout set and stop all members simultaneously, which frequently yields better performance and calibration versus per-model stopping.
- Overlapping validation splits: Employ partially overlapping holdouts to maximize training data usage without sacrificing joint evaluation (Fredsgaard et al., 6 Nov 2025).
- Parameter-efficient variants: Techniques such as BatchEnsemble or Layer Ensembles provide exponential sample diversity at reduced inference cost by factorizing randomness at the layer-level instead of the network level (Oleksiienko et al., 2022).
At inference, computational burden grows linearly with , motivating efficient techniques such as optimized parallel inference, selective sampling, and—where sequential data is available—time-multiplexing ensemble evaluation (DESOT) to spread members over time without increasing per-frame cost (Meding et al., 2023).
5. Extensions: Bayesian Refinements and Repulsion
While standard deep ensembles approximate the posterior with a mixture of delta functions, recent directions extend this by:
- Gaussian Process formalisms: Interpreting the ensemble as an empirical mean and covariance process over functions, tuning via functional evidence lower bounds (fELBO), and relating to GP priors (Deng et al., 2022).
- Greedy and submodular construction: Formulating ensemble selection as -divergence minimization in function space, exploiting submodularity to greedily assemble high-quality posterior approximations with explicit diversity regularization (Tiulpin et al., 2021).
- Repulsive ensembles: Enforcing member diversity directly in parameter or function space via kernel-based repulsion terms during training, thereby approximating Bayesian posterior sampling as a Wasserstein gradient flow (d'Angelo et al., 2021, Wild et al., 2023).
Improved posterior approximations incorporating Gaussian mixtures around MAP solutions further enlarge epistemic uncertainty, better capturing the true posterior function (Hoffmann et al., 2021).
6. Empirical Properties: Accuracy, Calibration, and OOD Robustness
Deep ensembles robustly improve both accuracy and uncertainty calibration compared to single models and most Bayesian neural inference approaches (e.g., variational Bayes, MC-dropout), especially under distributional shift and in OOD detection. For example, ensembles achieve lower NLL and ECE on CIFAR-10/100 and ImageNet corruptions, and assign systematically higher uncertainty to OOD inputs, yielding superior error-vs-uncertainty trade-offs (Fort et al., 2019, d'Angelo et al., 2021, Deng et al., 2022).
Nevertheless, when compared to a single larger model with equivalent parameter count and in-distribution accuracy, deep ensembles offer essentially indistinguishable OOD uncertainty, calibration, and robustness. Thus, the gains result from increased overall capacity rather than the unique structure of the ensemble (Abe et al., 2022). The practice of pruning ensembles using focal diversity metrics—measuring “failure independence” on misclassified examples—enables significant cost reductions with minimal loss of accuracy or robustness (Wu et al., 2023).
7. Practical Applications and Impact
Deep ensembles are broadly applicable across scientific and engineering domains:
- Statistical downscaling in climate modeling: Deliver uncertainty-calibrated, high-resolution climate projections under nonstationary, climate-changed conditions (González-Abad et al., 2023).
- Robotic perception and human motion forecasting: Provide uncertainty-aware predictions critical for safe human–robot collaboration, leveraging model and inference-time stochasticity for volumetric confidence sets (Eltouny et al., 2023).
- Low-data transfer learning: Exploit upstream diversity from pre-trained model pools and construct ensembles with state-of-the-art performance in regimes where labeled data is limited (Mustafa et al., 2020).
- Resource-constrained deployment: Methods such as DESOT (Deep Ensembles Spread Over Time) enable full ensemble benefits for sequential data at the computational cost of a single model (Meding et al., 2023).
Distillation and optimization of ensembles into compact student models, especially with strategies that transfer member diversity, further enhance their practicality for edge and embedded applications (Nam et al., 2021).
Deep ensembles, by virtue of their structural simplicity, ability to capture function-space multimodality, and empirical calibration under shift, have become a standard tool in both theoretical analyses and demanding real-world applications. Ongoing work focuses on closing the gap between their empirical success and the statistical optimality of explicit Bayesian inference, as well as improving their computational efficiency and understanding their fundamental statistical properties.