Bayesian Active Learning by Disagreement (BALD)
- Bayesian Active Learning by Disagreement (BALD) is an information-theoretic acquisition function that selects data points with high epistemic uncertainty to maximize information gain.
- It underpins various domains such as deep learning, Gaussian processes, and preference learning by reducing the number of labeled samples needed.
- Recent adaptations enhance BALD’s scalability and robustness by addressing challenges like distribution shift, weak supervision, and structured outputs.
Bayesian Active Learning by Disagreement (BALD) is an information-theoretic acquisition function designed to optimize the selection of data points in active learning via Bayesian models. Its primary objective is to maximize the expected information gain about the model parameters by preferentially sampling datapoints where the Bayesian predictive model exhibits high epistemic uncertainty. The BALD criterion has proved foundational across classical machine learning, deep learning, regression, preference learning, and scientific applications. Recent innovations have extended its reach, adapted its core principles, and addressed its limitations in areas such as distributional shift, weak supervision, batch selection, and structured outputs.
1. Principle of Information Gain and Predictive Entropy
BALD formalizes active learning as a process of querying the input that yields maximal expected reduction in the entropy of the parameter posterior conditioned on the current dataset . In Bayesian settings, this reduction can be expressed as:
However, when the model—such as a Gaussian process classifier—has an infinite-dimensional parameter space, direct computation of is infeasible. The pivotal insight is that, for such nonparametric models, mutual information can be recast purely in terms of predictive entropies:
Here, is the entropy of the predictive distribution for given the current data, and quantifies the average (conditional) entropy when the underlying function value is known. This formulation is computationally tractable, since output variables are low-dimensional.
In Gaussian process classifiers with probit likelihoods and approximate inference (e.g., Expectation Propagation), BALD involves evaluating the binary entropy of the averaged predictive probability and a closed-form, approximation-based expectation over the function’s posterior. The resulting objective seeks points with large overall predictive uncertainty but low average conditional uncertainty—i.e., maximal disagreement among plausible functions (1112.5745).
2. Extensions to Deep Learning and Large-Scale Models
Deep learning systems, including convolutional neural networks and recurrent architectures, have adopted approximate Bayesian inference to enable uncertainty quantification. Bayesian Active Learning by Disagreement is commonly operationalized via Monte Carlo dropout, which treats dropout as a variational approximation to the Bayesian posterior. For each unlabeled sample , the active learning system performs multiple stochastic forward passes:
where represents sampled model weights.
The Bayesian mutual information (BALD) acquisition function is computed as:
Samples with high disagreement among forward passes but low average conditional entropy are prioritized. Empirical studies on image and language data demonstrate that BALD can substantially reduce the number of required labeled samples to reach a given accuracy, outperforming random acquisition and classic uncertainty sampling. For instance, fewer than half the labeled images are needed on MNIST compared to random sampling, with similar trends for complex medical or scientific datasets (Gal et al., 2017, Siddhant et al., 2018, Walmsley et al., 2019).
3. Adaptations to Structured Outputs and Scientific Domains
BALD has been extended to handle specialized outputs and domain-specific requirements:
- Semantic Segmentation and Scientific Imaging: In applications such as agronomic semantic segmentation, BALD is computed over dense prediction maps using MC dropout. Individual pixel uncertainties are aggregated (commonly using the sum) to yield per-image acquisition scores. Extensions such as PowerBALD incorporate a temperature scaling to enhance diversity in selected samples. However, high class-imbalance or redundancy, as seen in precision agriculture or LiDAR scans, can limit the effectiveness of BALD, suggesting that adjustments in uncertainty aggregation or alternative acquisition strategies may be required (Marrewijk et al., 3 Apr 2024, Duong et al., 2023).
- Preference Learning: By reformulating binary preference learning as binary classification over pairs, BALD can be directly applied to preference data. The induced “difference-based” kernel in the GP preference learning model preserves anti-symmetry and allows for querying pairs that maximize model disagreement—yielding rapid information gain about underlying user or volunteer preferences (1112.5745, Walmsley et al., 2019).
- Handling Censored or Weakly-Supervised Data: When only partially observed or imprecise labels are available, as in censored regression or weak supervision, BALD has been generalized (e.g., -BALD) to explicitly handle joint label and censorship variables. These methods derive new entropy and mutual information formulations for censored or noisy observations and empirically show improved performance over standard BALD (Hüttel et al., 19 Feb 2024, Olmin et al., 2022).
4. Algorithmic Variants, Computational Scalability, and Redundancy Mitigation
Active learning in the batch setting motivates further development and approximation of BALD:
- BatchBALD and k-BALD: BatchBALD generalizes BALD to select batches using joint mutual information, but incurs significant computational cost due to high-dimensional entropy computations. The k-BALD family approximates BatchBALD via inclusion–exclusion principles up to k-wise interactions, providing significant speedups while maintaining similar sample efficiency. A dynamic approach to setting is proposed, adjusting the approximation order as model predictions become increasingly independent (Kirsch, 2023).
- Geometric Core-Set Construction (GBALD): By performing uncertainty estimation after ellipsoidal (rather than spherical) core-set construction, GBALD mitigates the effect of an uninformative prior and reduces redundant acquisitions. Experiments reveal that GBALD selects more diverse, representative acquisitions and outperforms standard BALD under challenging label imbalance or data redundancy (Cao et al., 2021).
- Balanced Entropy Acquisition (BalEntAcq): BalEntAcq aims to “balance” posterior and label uncertainties using Beta approximations of class probabilities. It offers closed-form, standalone batch scoring that reduces the selection of overly redundant or noisy samples, outperforming BALD and related criteria on standard image benchmarks (Woo, 2021).
- Distribution Disagreement in Regression (BALSA): In regression with models such as normalizing flows, BALSA adapts BALD to quantify epistemic uncertainty by computing pairwise or grid-based divergences (e.g. KL or Earth Mover’s Distance) between predictive distributions from MC dropout. This enables robust detection of epistemic uncertainty in settings where standard entropy measures confound noise and model uncertainty (Werner et al., 2 Jan 2025).
5. Test Distribution Awareness and Predictive-Oriented Acquisition
A key criticism of classic BALD is its lack of consideration for the test-time input distribution. Acquiring data points that reduce parameter uncertainty can be suboptimal if these points are irrelevant for the ultimate predictive task, especially in the presence of distribution shift or outliers. To address this limitation, new acquisition functions have been developed:
- Expected Predictive Information Gain (EPIG): EPIG quantifies the expected reduction in predictive entropy for the input distribution of actual interest, focusing on how a candidate label will affect predictions on relevant test inputs. Empirical results demonstrate that EPIG can yield stronger predictive performance than BALD, especially when pools are uncurated or heterogeneous (Smith et al., 2023, Kirsch et al., 2021, Smith et al., 26 Apr 2024).
- Joint EPIG (JEPIG): JEPIG integrates classical parameter-centric and prediction-centric uncertainty, discounting information about model parameters that is irrelevant for downstream prediction. It enables more robust acquisition under distribution shift, as confirmed by lower acquisition of out-of-distribution samples and improved accuracy in benchmark studies (Kirsch et al., 2021).
6. Impact, Limitations, and Application-Specific Insights
BALD and its variants have been shown to efficiently reduce label acquisition costs, accelerate learning, and achieve high accuracy with fewer samples in settings including image classification, scientific data acquisition, preference learning, and high-stakes medical tasks. For instance, in the Galaxy Zoo project, combining a generative model for volunteer votes with BALD selection enabled up to 60% fewer labels for equivalent classification accuracy, illustrating high practical impact (Walmsley et al., 2019).
However, limitations remain:
- Redundancy and Class Imbalance: In domains with high label redundancy or severe class imbalance (e.g., agricultural segmentation, LiDAR scans), classic BALD may select similar or uninformative samples, providing limited added value over random sampling and occasionally increasing annotation costs (Marrewijk et al., 3 Apr 2024, Duong et al., 2023).
- Computational Overhead: Full Bayesian inference and repeated MC dropout are computationally intensive, particularly in large models or when batched acquisition is needed. Approximations such as k-BALD, parallelizable analytic formulations, or fixed encoder architectures (semi-supervised learning) ameliorate this (Kirsch, 2023, Woo, 2021, Smith et al., 26 Apr 2024).
- Uninformative or Biased Priors: When the initial labeled set is biased, standard BALD may underperform, while core-set and diversity-based methods (e.g., GBALD) provide greater robustness (Cao et al., 2021).
A table summarizing some representative BALD variants and their domains:
BALD Variant | Application Domain | Key Technical Feature |
---|---|---|
BALD (classic) | GP classification, DL | Predictive entropy difference |
BatchBALD / k-BALD | Batch AL (deep learning) | Joint mutual info, inclusion–exclusion |
GBALD | Deep AL (vision) | Geometric ellipsoidal core-set |
BalEntAcq | Deep AL (vision) | Balanced entropy, Beta approximation |
BALSA | Regression (flows) | Distributional divergence (KL/EMD) |
EPIG/JEPIG | Test-shifted AL | Test-distribution weighted info gain |
-BALD | Censored regression | Censor-aware mutual information |
7. Directions, Implications, and Ongoing Research
BALD remains central in Bayesian active learning practice but is increasingly regarded as one pillar in a broader landscape of acquisition strategies. Recent work emphasizes test-distribution-aware acquisition (EPIG, JEPIG), explicitly disentangling epistemic and aleatoric uncertainty (e.g., BALSA in regression), and robust batch/diverse selection (k-BALD, GBALD, BalEntAcq). Advances in semi-supervised learning highlight the benefit of leveraging large pools of unlabeled data, and adaptation to streaming or online scenarios further broadens applicability (Huang et al., 2019, Smith et al., 26 Apr 2024, Smith et al., 2023).
Challenges include reliable uncertainty quantification under data bias, efficient large-batch computations, managing labeling cost versus precision trade-offs, and integration with domain-specific constraints (e.g., censoring, expert annotation costs). Future directions involve developing flexible acquisition functions that account for both predictive utility and computational scalability, as well as hybrid and meta-active learning strategies that adapt acquisition criteria in response to changing data and application demands.
BALD’s continued influence is manifest in frequent empirical evaluation as a baseline and in the creative extension of its fundamental principle—selecting datapoints in regions of maximal epistemic model disagreement across probabilistic models and acquisition frameworks.