Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 33 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 75 tok/s Pro

Kimi K2 220 tok/s Pro

GPT OSS 120B 465 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Bayesian Active Learning by Disagreement (BALD)

Updated 12 July 2025

Bayesian Active Learning by Disagreement (BALD) is an information-theoretic acquisition function that selects data points with high epistemic uncertainty to maximize information gain.
It underpins various domains such as deep learning, Gaussian processes, and preference learning by reducing the number of labeled samples needed.
Recent adaptations enhance BALD’s scalability and robustness by addressing challenges like distribution shift, weak supervision, and structured outputs.

Bayesian Active Learning by Disagreement (BALD) is an information-theoretic acquisition function designed to optimize the selection of data points in active learning via Bayesian models. Its primary objective is to maximize the expected information gain about the model parameters by preferentially sampling datapoints where the Bayesian predictive model exhibits high epistemic uncertainty. The BALD criterion has proved foundational across classical machine learning, deep learning, regression, preference learning, and scientific applications. Recent innovations have extended its reach, adapted its core principles, and addressed its limitations in areas such as distributional shift, weak supervision, batch selection, and structured outputs.

1. Principle of Information Gain and Predictive Entropy

BALD formalizes active learning as a process of querying the input $x$ that yields maximal expected reduction in the entropy of the parameter posterior conditioned on the current dataset $\mathcal{D}$ . In Bayesian settings, this reduction can be expressed as:

$I(f; y | x, \mathcal{D}) = H[f | \mathcal{D}] - \mathbb{E}_{y \sim p(y|x, \mathcal{D})} \left[ H\left[ f | x, y, \mathcal{D} \right] \right]$

However, when the model—such as a Gaussian process classifier—has an infinite-dimensional parameter space, direct computation of $H[f | \, \cdot \,]$ is infeasible. The pivotal insight is that, for such nonparametric models, mutual information can be recast purely in terms of predictive entropies:

$I(f; y | x, \mathcal{D}) = H[y | x, \mathcal{D}] - \mathbb{E}_{f\sim p(f|\mathcal{D})}[ H[y | x, f] ]$

Here, $H[y | x, \mathcal{D}]$ is the entropy of the predictive distribution for $y$ given the current data, and $\mathbb{E}_{f}[H[y|x,f]]$ quantifies the average (conditional) entropy when the underlying function value is known. This formulation is computationally tractable, since output variables $y$ are low-dimensional.

In Gaussian process classifiers with probit likelihoods and approximate inference (e.g., Expectation Propagation), BALD involves evaluating the binary entropy of the averaged predictive probability and a closed-form, approximation-based expectation over the function’s posterior. The resulting objective seeks points with large overall predictive uncertainty but low average conditional uncertainty—i.e., maximal disagreement among plausible functions (Houlsby et al., 2011).

2. Extensions to Deep Learning and Large-Scale Models

Deep learning systems, including convolutional neural networks and recurrent architectures, have adopted approximate Bayesian inference to enable uncertainty quantification. Bayesian Active Learning by Disagreement is commonly operationalized via Monte Carlo dropout, which treats dropout as a variational approximation to the Bayesian posterior. For each unlabeled sample $x$ , the active learning system performs multiple stochastic forward passes:

$p(y = c| x, \mathcal{D}_{\text{train}}) \approx \frac{1}{T} \sum_{t=1}^T p(y = c | x, w_t)$

where $w_t$ represents sampled model weights.

The Bayesian mutual information (BALD) acquisition function is computed as:

$I[y, w | x, \mathcal{D}_{\text{train}}] = H[y|x, \mathcal{D}_{\text{train}}] - \mathbb{E}_{p(w|\mathcal{D}_{\text{train}})}[H[y|x,w]]$

Samples with high disagreement among forward passes but low average conditional entropy are prioritized. Empirical studies on image and language data demonstrate that BALD can substantially reduce the number of required labeled samples to reach a given accuracy, outperforming random acquisition and classic uncertainty sampling. For instance, fewer than half the labeled images are needed on MNIST compared to random sampling, with similar trends for complex medical or scientific datasets (Gal et al., 2017, Siddhant et al., 2018, Walmsley et al., 2019).

3. Adaptations to Structured Outputs and Scientific Domains

BALD has been extended to handle specialized outputs and domain-specific requirements:

Semantic Segmentation and Scientific Imaging: In applications such as agronomic semantic segmentation, BALD is computed over dense prediction maps using MC dropout. Individual pixel uncertainties are aggregated (commonly using the sum) to yield per-image acquisition scores. Extensions such as PowerBALD incorporate a temperature scaling to enhance diversity in selected samples. However, high class-imbalance or redundancy, as seen in precision agriculture or LiDAR scans, can limit the effectiveness of BALD, suggesting that adjustments in uncertainty aggregation or alternative acquisition strategies may be required (Marrewijk et al., 3 Apr 2024, Duong et al., 2023).
Preference Learning: By reformulating binary preference learning as binary classification over pairs, BALD can be directly applied to preference data. The induced “difference-based” kernel in the GP preference learning model preserves anti-symmetry and allows for querying pairs that maximize model disagreement—yielding rapid information gain about underlying user or volunteer preferences (Houlsby et al., 2011, Walmsley et al., 2019).
Handling Censored or Weakly-Supervised Data: When only partially observed or imprecise labels are available, as in censored regression or weak supervision, BALD has been generalized (e.g., $\mathcal{C}$ -BALD) to explicitly handle joint label and censorship variables. These methods derive new entropy and mutual information formulations for censored or noisy observations and empirically show improved performance over standard BALD (Hüttel et al., 19 Feb 2024, Olmin et al., 2022).

4. Algorithmic Variants, Computational Scalability, and Redundancy Mitigation

Active learning in the batch setting motivates further development and approximation of BALD:

BatchBALD and k-BALD: BatchBALD generalizes BALD to select batches using joint mutual information, but incurs significant computational cost due to high-dimensional entropy computations. The k-BALD family approximates BatchBALD via inclusion–exclusion principles up to k-wise interactions, providing significant speedups while maintaining similar sample efficiency. A dynamic approach to setting $k$ is proposed, adjusting the approximation order as model predictions become increasingly independent (Kirsch, 2023).
Geometric Core-Set Construction (GBALD): By performing uncertainty estimation after ellipsoidal (rather than spherical) core-set construction, GBALD mitigates the effect of an uninformative prior and reduces redundant acquisitions. Experiments reveal that GBALD selects more diverse, representative acquisitions and outperforms standard BALD under challenging label imbalance or data redundancy (Cao et al., 2021).
Balanced Entropy Acquisition (BalEntAcq): BalEntAcq aims to “balance” posterior and label uncertainties using Beta approximations of class probabilities. It offers closed-form, standalone batch scoring that reduces the selection of overly redundant or noisy samples, outperforming BALD and related criteria on standard image benchmarks (Woo, 2021).
Distribution Disagreement in Regression (BALSA): In regression with models such as normalizing flows, BALSA adapts BALD to quantify epistemic uncertainty by computing pairwise or grid-based divergences (e.g. KL or Earth Mover’s Distance) between predictive distributions from MC dropout. This enables robust detection of epistemic uncertainty in settings where standard entropy measures confound noise and model uncertainty (Werner et al., 2 Jan 2025).

5. Test Distribution Awareness and Predictive-Oriented Acquisition

A key criticism of classic BALD is its lack of consideration for the test-time input distribution. Acquiring data points that reduce parameter uncertainty can be suboptimal if these points are irrelevant for the ultimate predictive task, especially in the presence of distribution shift or outliers. To address this limitation, new acquisition functions have been developed:

Expected Predictive Information Gain (EPIG): EPIG quantifies the expected reduction in predictive entropy for the input distribution of actual interest, focusing on how a candidate label will affect predictions on relevant test inputs. Empirical results demonstrate that EPIG can yield stronger predictive performance than BALD, especially when pools are uncurated or heterogeneous (Smith et al., 2023, Kirsch et al., 2021, Smith et al., 26 Apr 2024).
Joint EPIG (JEPIG): JEPIG integrates classical parameter-centric and prediction-centric uncertainty, discounting information about model parameters that is irrelevant for downstream prediction. It enables more robust acquisition under distribution shift, as confirmed by lower acquisition of out-of-distribution samples and improved accuracy in benchmark studies (Kirsch et al., 2021).

6. Impact, Limitations, and Application-Specific Insights

BALD and its variants have been shown to efficiently reduce label acquisition costs, accelerate learning, and achieve high accuracy with fewer samples in settings including image classification, scientific data acquisition, preference learning, and high-stakes medical tasks. For instance, in the Galaxy Zoo project, combining a generative model for volunteer votes with BALD selection enabled up to 60% fewer labels for equivalent classification accuracy, illustrating high practical impact (Walmsley et al., 2019).

However, limitations remain:

Redundancy and Class Imbalance: In domains with high label redundancy or severe class imbalance (e.g., agricultural segmentation, LiDAR scans), classic BALD may select similar or uninformative samples, providing limited added value over random sampling and occasionally increasing annotation costs (Marrewijk et al., 3 Apr 2024, Duong et al., 2023).
Computational Overhead: Full Bayesian inference and repeated MC dropout are computationally intensive, particularly in large models or when batched acquisition is needed. Approximations such as k-BALD, parallelizable analytic formulations, or fixed encoder architectures (semi-supervised learning) ameliorate this (Kirsch, 2023, Woo, 2021, Smith et al., 26 Apr 2024).
Uninformative or Biased Priors: When the initial labeled set is biased, standard BALD may underperform, while core-set and diversity-based methods (e.g., GBALD) provide greater robustness (Cao et al., 2021).

A table summarizing some representative BALD variants and their domains:

BALD Variant	Application Domain	Key Technical Feature
BALD (classic)	GP classification, DL	Predictive entropy difference
BatchBALD / k-BALD	Batch AL (deep learning)	Joint mutual info, inclusion–exclusion
GBALD	Deep AL (vision)	Geometric ellipsoidal core-set
BalEntAcq	Deep AL (vision)	Balanced entropy, Beta approximation
BALSA	Regression (flows)	Distributional divergence (KL/EMD)
EPIG/JEPIG	Test-shifted AL	Test-distribution weighted info gain
$\mathcal{C}$ -BALD	Censored regression	Censor-aware mutual information

7. Directions, Implications, and Ongoing Research

BALD remains central in Bayesian active learning practice but is increasingly regarded as one pillar in a broader landscape of acquisition strategies. Recent work emphasizes test-distribution-aware acquisition (EPIG, JEPIG), explicitly disentangling epistemic and aleatoric uncertainty (e.g., BALSA in regression), and robust batch/diverse selection (k-BALD, GBALD, BalEntAcq). Advances in semi-supervised learning highlight the benefit of leveraging large pools of unlabeled data, and adaptation to streaming or online scenarios further broadens applicability (Huang et al., 2019, Smith et al., 26 Apr 2024, Smith et al., 2023).

Challenges include reliable uncertainty quantification under data bias, efficient large-batch computations, managing labeling cost versus precision trade-offs, and integration with domain-specific constraints (e.g., censoring, expert annotation costs). Future directions involve developing flexible acquisition functions that account for both predictive utility and computational scalability, as well as hybrid and meta-active learning strategies that adapt acquisition criteria in response to changing data and application demands.

BALD’s continued influence is manifest in frequent empirical evaluation as a baseline and in the creative extension of its fundamental principle—selecting datapoints in regions of maximal epistemic model disagreement across probabilistic models and acquisition frameworks.