BADGE: Batch Active Learning via Gradient Embeddings
- The paper introduces BADGE, a batch active learning method that leverages gradient embeddings to select diverse and uncertain samples for annotation.
- BADGE employs a k-means++ seeding strategy on hallucinated gradients to balance uncertainty and diversity in mini-batch selection.
- The method achieves label efficiency and computational scalability, outperforming traditional uncertainty-only strategies across various datasets.
Batch Active Learning by Diverse Gradient Embeddings (BADGE) is a batch-mode active learning strategy for deep neural networks that selects informative and diverse mini-batches for annotation by leveraging gradient-based embeddings. It aims to reduce the number of labeled data points required to attain high predictive performance, particularly in settings where annotation costs are significant and models are trained on large datasets.
1. Foundations and Motivation
Batch active learning departs from traditional sequential active learning by querying labels for multiple samples at once, reflecting the practical constraints of large-scale deep learning pipelines and annotation workflows. In standard uncertainty-based active learning, samples are ranked by uncertainty (e.g., softmax margin, entropy) and the top-k are selected. However, such selection often yields highly redundant mini-batches—points clustered in a small region of the input or feature space—offering limited new information per label acquired. BADGE was developed to address the challenge of composing mini-batches that are collectively diverse and individually informative, improving the efficiency of label acquisition for deep models (Ash et al., 2019).
2. Algorithmic Framework
BADGE operates iteratively, with each round consisting of the following steps:
- Training the Model: Begin with a labeled set and train a neural network classifier using the current labeled data, typically with cross-entropy loss.
- Gradient Embedding Computation: For each unlabeled candidate in the pool, BADGE generates a hallucinated gradient embedding. This is obtained by (i) assigning to the class of current highest model confidence, , and (ii) computing the gradient of the loss with respect to the weights of the final (output) layer:
where is the predicted softmax probability for class , is the output of the penultimate layer, and is the indicator function.
- Uncertainty Quantification: The norm is a lower bound on the gradient norm for any possible label, which acts as a conservative measure of how much knowing 's label could impact the model parameters. High-norm samples are deemed more "uncertain"/"informative."
- Diversity Promotion: Rather than selecting the top-k samples by norm, BADGE uses the k-means++ seeding algorithm in the space of gradient embeddings. The centers are initialized to promote coverage: each new sample is chosen with probability proportional to its squared distance from all previously selected batch members. This procedure constructs batches that are both uncertain (due to high gradient magnitude) and diverse (due to maximal separation in gradient space).
- Label Acquisition and Retraining: The selected mini-batch of samples is sent for annotation, added to the labeled set, and the model is retrained for the next iteration.
This workflow naturally encodes a trade-off between informativeness and diversity in batch selection, without requiring manual hyperparameter tuning for diversity-uncertainty weighting (Ash et al., 2019).
3. Mathematical Properties of Gradient Embeddings
The "gradient embedding" for input , defined in BADGE as above, confers several important properties:
- Uncertainty Encoding: The squared norm quantifies, for a hallucinated label , the lower bound on the possible gradient norm relative to any other label. The expression
reveals that the model is less certain (higher norm) for more ambiguous predictions [Proposition 1, (Ash et al., 2019)].
- Diversity in Representation: Since depends both on the penultimate representation and the class probabilities, samples with similar input representations but different predictions, or vice versa, will be separated in this space, enabling selection of non-redundant examples.
- Suitability for Clustering: The geometry of gradient embeddings allows adaptation of k-means++/other clustering algorithms, efficiently partitioning the space for scalable batch selection.
4. Empirical Performance and Scalability
BADGE has been empirically validated on a wide variety of tasks, including image classification (MNIST, CIFAR10, SVHN), text classification, and OpenML tabular datasets, and across architectures from MLPs to CNNs (ResNet, VGG). Key findings include:
- Label Efficiency: BADGE outperforms or matches classic uncertainty-only strategies (e.g., margin sampling, entropy) and diversity-only strategies (e.g., core-set selection based on feature embeddings), across a range of batch sizes (from 100 to 10,000) (Ash et al., 2019).
- Batch Size Robustness: The method is robust to batch size variations; for small batches, uncertainty dominates, while larger batches benefit from explicit diversity enforced by clustering.
- Computation: By leveraging efficient k-means++ initialization (rather than full submodular maximization or k-DPP sampling), BADGE avoids operations found in submodular or core-set approaches. Gradient embedding and clustering scale linearly with the pool size and batch size, with manageable runtime even for deep models (Zhdanov, 2019, Ash et al., 2019).
5. Relation to Alternative Approaches
BADGE distinguishes itself from other batch active learning strategies in several important respects:
Method | Uncertainty | Diversity Mechanism | Computational Profile |
---|---|---|---|
BADGE | Gradient norm (hallucinated) | k-means++ on gradients | per round |
Minimum-Margin (Jiang et al., 2019) | Minimum margin across bootstraps | Model disagreement | ; no explicit clustering |
DPP (Bıyık et al., 2019) | Acquisition score (e.g., entropy) | DPP kernel | Greedy: ; Mode: higher |
Core-set/Facility Location | - | Submodular function on features | |
Discriminative AL (Gissin et al., 2019) | Binary classification (labeled/unlabeled split) | - | Simple; diversity implicit if mini-batched |
- Compared to Minimum-Margin Active Learning: BADGE uses unsupervised clustering of gradient embeddings to achieve diversity, whereas min-margin leverages diversity induced directly by disagreement among bootstrapped models. Both balance informativeness and diversity but differ in the source of that diversity (Jiang et al., 2019).
- Compared to Determinantal Point Processes: DPP-based methods define explicit kernels incorporating informativeness and similarity, but can require approximate inference or greedy selection due to computational hardness. BADGE sidesteps this with faster clustering-based proxy objectives (Bıyık et al., 2019).
- Compared to Discriminative Active Learning: DAL frames selection as a domain adaptation problem via binary discrimination. BADGE remains expressly tied to the uncertainty/diversity view, and—unlike DAL—directly clusters in a supervised (gradient) representation rather than generic latent spaces (Gissin et al., 2019).
6. Limitations and Challenges
While BADGE offers robust improvements, certain limitations and operational caveats are acknowledged (Ash et al., 2019):
- Reliance on Hallucinated Labels: The use of model-predicted labels for gradient computation introduces a lower bound on the true uncertainty. Where the model is poorly calibrated or far from the data manifold, the embeddings may not fully capture informativeness.
- Representational Reliability: When representations in the penultimate layer are not mature (e.g., early in training or with underparameterized models), the embedding space may not accurately reflect latent diversity, potentially impacting batch quality.
- Lower-Diversity Scenarios: In low-margin regions or highly imbalanced datasets, the clustering in gradient space may yield less-than-ideal spread, particularly if input diversity is not reflected in output gradients.
7. Practical Implications and Extensions
BADGE is well-suited for real-world scenarios where batch-mode annotation is logistically desirable and cost efficiency is paramount, such as in medical imaging or human-in-the-loop labeling of large-scale datasets. The approach is inherently architecture-agnostic, requiring only access to penultimate layers and output probabilities.
Recent research has further adapted BADGE to regression via kernel approximations (e.g., predictive covariance kernels for black-box models (Kirsch, 2023)), to structured tasks in NLP (with weighted clustering and length normalization (Kim, 2020)), and to settings allowing for adaptive batch sizes via probabilistic numerics (Adachi et al., 2023).
A plausible implication is that expanding BADGE with richer representational or model-derived information (such as derivative information in Gaussian process regression, or calibration of pseudolabels (Venkatesh et al., 2020)) can yield substantive gains in both informativeness and batch diversity, suggesting avenues for further hybrid or principled extensions.
References
- (Ash et al., 2019) Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds
- (Zhdanov, 2019) Diverse mini-batch Active Learning
- (Jiang et al., 2019) Minimum-Margin Active Learning
- (Bıyık et al., 2019) Batch Active Learning Using Determinantal Point Processes
- (Gissin et al., 2019) Discriminative Active Learning
- (Kirsch, 2023) Black-Box Batch Active Learning for Regression
- (Adachi et al., 2023) Adaptive Batch Sizes for Active Learning A Probabilistic Numerics Approach
- (Kim, 2020) Deep Active Learning for Sequence Labeling Based on Diversity and Uncertainty in Gradient
- (Venkatesh et al., 2020) Ask-n-Learn: Active Learning via Reliable Gradient Representations for Image Classification
- (Yu et al., 3 Aug 2024) Batch Active Learning in Gaussian Process Regression using Derivatives