BatchEnsemble: Efficient Neural Ensembles

Updated 30 January 2026

BatchEnsemble is a neural ensemble technique that approximates deep ensembles with rank-1 perturbations of shared weights, dramatically reducing computational and memory overhead.
It estimates uncertainty by averaging predictions across members and employs metrics such as NLL, ECE, and JSD, achieving competitive performance on various tasks.
Vectorized training and inference enable scalable application, though its rank-1 parameterization limits diversity in high-dimensional scenarios.

BatchEnsemble is a neural architecture paradigm that addresses the computational bottleneck of traditional deep ensembles by representing multiple predictive functions as efficient, rank-1 perturbations of a shared weight base. The method was introduced by Wen et al. (2020) to provide ensemble-like epistemic uncertainty with near-single-model memory and parameter footprint, and has since prompted a series of studies re-evaluating its calibration, uncertainty quantification, and member diversity properties in low-latency and resource-constrained scenarios (Wen et al., 2020, Zamyatin et al., 23 Jan 2026, Blørstad et al., 29 Jan 2026).

1. Mathematical Formulation

BatchEnsemble replaces the full parameter duplication of $k$ independent neural networks with the encoding of each ensemble member via a shared weight matrix modulated by learned input and output adapters. For a layer with base weight $W\in\mathbb R^{m\times n}$ , member $i$ is specified as

$W_i = W \circ (r_i\, s_i^\top), \quad r_i\in\mathbb R^m,\ s_i\in\mathbb R^n$

where “ $\circ$ ” represents the Hadamard (element-wise) product. The effective function for an input $x\in\mathbb R^n$ is given by

$W_i x = W\ (s_i\circ x)\;\circ\; r_i$

This admits batched inference and training: by stacking data so each sample is repeated $k$ times (or partitioned into $k$ chunks), all ensemble members are computed in a single forward pass (Wen et al., 2020). The per-member parameter growth is $O(m+n)$ , versus $O(mn)$ for naive deep ensembles. In practice, the addition of rank-1 adapters reduces model overhead from $k\,m\,n$ parameters to $mn + k(m+n)$ , a dramatic reduction for large networks and small $k$ (Blørstad et al., 29 Jan 2026).

2. Uncertainty Estimation and Calibration

BatchEnsemble is positioned as an approximate Bayesian ensemble. At test time, output probabilities (classification) or predictive means and variances (regression) are averaged across members:

$\bar p(y=c|x) = \frac{1}{k}\sum_{i=1}^k p(y=c|x, \theta_i)$

Uncertainty metrics used include negative log-likelihood (NLL), expected calibration error (ECE), predictive entropy $H$ , and Jensen–Shannon divergence (JSD) as a measure of member disagreement (proxy for epistemic uncertainty) (Zamyatin et al., 23 Jan 2026, Blørstad et al., 29 Jan 2026). For regression, epistemic and aleatoric uncertainties can be disentangled through variance decomposition:

$\sigma^{2*}(x) = \frac{1}{k}\sum_{i=1}^k \sigma^2_{i}(x) + \frac{1}{k}\sum_{i=1}^k (\mu_{i}(x) - \mu^*(x))^2$

where $\mu^*(x)$ is the ensemble mean. Entropy decomposition for classification follows a similar rationale.

Empirical results show competitive uncertainty estimates on tabular/time-series data, with BatchEnsemble matching or outperforming deep ensembles and MC dropout on NLL and calibration metrics (Blørstad et al., 29 Jan 2026). However, for vision models (CIFAR-10/10C/SVHN), BatchEnsemble accuracy and calibration nearly coincide with the single-model baseline, and JSD remains near zero—indicating minimal epistemic uncertainty (Zamyatin et al., 23 Jan 2026).

3. Training, Inference, and Computational Complexity

BatchEnsemble leverages vectorized computation to enable full-ensemble evaluation within a single forward and backward pass. During training, gradients with respect to $W$ , $r_i$ , and $s_i$ are computed using the chain rule, keeping all member parameters updated simultaneously:

$\frac{\partial \mathcal{L}}{\partial W} = \sum_{i=1}^k G_{W_i}\circ (r_i\,s_i^\top)$

where $G_{W_i}$ is the gradient with respect to member $i$ 's weight (Wen et al., 2020). Because the adapters are low-dimensional, the additional computational cost is marginal compared to matmul operations, and runtime overhead (inference) is typically 1.1–1.3× single-model cost, as opposed to $k\times$ for full ensembles (Blørstad et al., 29 Jan 2026). For instance, ResNet-18 on CIFAR-10 has: | Model/Ensemble | Params (M) | Training time | Inference cost | |-----------------------|:------------:|------------------|-------------------| | Single model | 11.17 | 15 min | 1× | | Deep Ensemble ( $k=4$ ) | 44.71 | 1 h 15 min | 4× | | BatchEnsemble ( $k=4$ ) | 11.22 | 1 h 1 min | 1.3× | (Zamyatin et al., 23 Jan 2026, Wen et al., 2020)

4. Empirical Performance and Diversity Properties

Benchmark studies juxtapose BatchEnsemble with single models, MC dropout, and deep ensembles using standardized metrics across image and tabular/time-series datasets (Zamyatin et al., 23 Jan 2026, Blørstad et al., 29 Jan 2026). On tabular and sequential tasks, BatchEnsemble achieves parity with deep ensembles in accuracy, RMSE, NLL, and ECE (for $k=10$ ) while using roughly 10–20% the parameters. In time-series forecasting (e.g., Electric dataset): | Model | RMSE | NLL | RMSCE | |----------------|-----------------|-----------------|------------------| | Single | 0.035 ± 0.003 | –2.80 ± 0.05 | 0.10 ± 0.02 | | MC dropout | 0.033 ± 0.003 | –2.90 ± 0.04 | 0.09 ± 0.02 | | BatchEnsemble | 0.028 ± 0.002 | –3.50 ± 0.03 | 0.07 ± 0.01 | | Deep ensemble | 0.029 ± 0.002 | –3.45 ± 0.03 | 0.07 ± 0.01 | (Blørstad et al., 29 Jan 2026)

Conversely, in high-dimensional classification, member outputs are nearly identical. Functional diversity is quantified as the fraction of test inputs assigned different predictions by different members, and parameter diversity by the cosine similarity between member weights. Deep Ensembles show 10–20% disagreement and low cosine similarity (< 1), whereas BatchEnsemble disagreement and parameter distance approach zero, causing epistemic uncertainty to collapse (Zamyatin et al., 23 Jan 2026).

5. Theoretical Limitations

The set of functions expressible by BatchEnsemble is fundamentally limited. The parameter tuples $(W\circ r_1 s_1^{\top},\ldots, W\circ r_k s_k^{\top})$ that define $k$ members form a strict, measure-zero subset of the unconstrained product space $(W_1, \ldots, W_k)$ for deep ensembles. As a consequence, BatchEnsemble is confined to local, rank-1 multiplicative “twists” of one central solution, incapable of spanning the full diversity needed to realize distinct predictive modes under severe distribution shift. This provides a theoretical basis for the empirical finding that BatchEnsemble members have near-identical parameters and produce nearly identical outputs (Zamyatin et al., 23 Jan 2026).

6. Extensions and Practical Applications

BatchEnsemble has been extended to sequential neural architectures, notably in GRUBE, a BatchEnsemble GRU cell. GRUBE integrates adapter scaling into all GRU gate calculations, enabling low-overhead uncertainty quantification in time-series modeling (Blørstad et al., 29 Jan 2026). Lifelong learning scenarios exploit the parameter-sharing scheme by freezing $W$ after the first task and learning new adapters $(r_t,s_t)$ per task, yielding competitive performance to progressive neural networks while reducing memory overhead—e.g., Split-ImageNet (100 tasks) sees only 20% parameter growth over the base (Wen et al., 2020).

In resource-constrained settings requiring epistemic uncertainty for OOD detection or selective prediction, BatchEnsemble can provide scalable, robust estimates—especially in MLPs and GRUs—matching Deep Ensembles in uncertainty quality and computational cost (Blørstad et al., 29 Jan 2026). However, if member diversity or strong calibration is critical (high-dimensional CNNs, severe distribution shift), current BatchEnsemble formulations may underperform due to theoretical constraints on expressible member diversity (Zamyatin et al., 23 Jan 2026).

7. Implementation Guidelines and Future Directions

Recommended implementation strategies include adapter initialization with $\pm 1$ random signs; including both $r_k$ and $s_k$ in each layer; and vectorizing forward/backward passes over the ensemble index for hardware efficiency (Blørstad et al., 29 Jan 2026). Ablation studies confirm that omitting adapters or applying them only to select layers significantly degrades BatchEnsemble’s performance. For GRUBE, all gate adapters must be present for optimal sequential uncertainty modeling.

A plausible implication is that efficient-ensemble designs must incorporate higher-rank or non-multiplicative perturbations, rather than rank-1 multiplicative scaling, to approach the diversity and robustness of true deep ensembles. Current theoretical and empirical analyses suggest BatchEnsemble “behaves more like a single model than a true ensemble” in critical uncertainty tasks (Zamyatin et al., 23 Jan 2026). Subsequent research is likely to refine adapter parameterization or develop hybrid architectures that trade increased parameter overhead for improved epistemic diversity.

Key references:

Wen et al., “BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning” (Wen et al., 2020)
D’Angelo et al., “Is BatchEnsemble a Single Model? On Calibration and Diversity of Efficient Ensembles” (Zamyatin et al., 23 Jan 2026)
Mickisch et al., “Evaluating Prediction Uncertainty Estimates from BatchEnsemble” (Blørstad et al., 29 Jan 2026)