Sequential and Batch Active Learning
- Sequential and batch active learning are methodologies that select unlabeled data to maximize model performance while reducing annotation costs.
- Sequential active learning updates models with one label at a time for high information gain, whereas batch methods leverage parallelism to improve throughput.
- Batch techniques use strategies like DPP, submodular optimization, and clustering to ensure diversity and mitigate redundancy in query selection.
Sequential and batch active learning are methodologies for selecting unlabeled data points to be labeled so as to maximize the performance of a statistical model with minimal annotation effort. Both paradigms underpin the design of annotation pipelines for machine learning where label acquisition is expensive, with the sequential regime emphasizing maximal information per label and batch approaches focusing on parallelism and runtime efficiency. Advances in batch active learning have produced diverse algorithmic and theoretical frameworks—most notably determinantal point processes, submodular optimization, clustering-based heuristics, and generative flow networks—that address both informativeness and diversity in batch query selection. The following sections synthesize principled methods, theoretical and empirical tradeoffs, and contemporary developments in both sequential and batch modes.
1. Fundamental Paradigms and Definitions
Sequential active learning selects a single unlabeled point at each round, queries its label, updates the model, and continues. The classic strategy is to maximize an informativeness criterion, exemplified by uncertainty sampling (max-entropy, margin-based) or expected error reduction, querying points that are expected to maximize the model's information gain per label (Azimi et al., 2012). Each query can depend fully on the up-to-date model and previously acquired labels.
Batch active learning generalizes this by selecting a batch of size to be labeled in parallel per round. The objective for batch selection may combine informativeness and diversity, addressing the redundancy that can arise if naively choosing the top- points by informativeness (Zhdanov, 2019). Definitions for batch selection often take the form:
where could represent joint information gain, diversity-penalized uncertainty, or the determinant of a kernel matrix for DPP-based models (Bıyık et al., 2019).
Key dimensions:
- Sequential: Adaptivity, information-optimal per label, expensive in wall-clock or compute when retraining after every label, sequential dependence.
- Batch: Reduced wall-clock (parallelism), fewer retrains per label, risk of redundancy, batch-aware diversity constraints required.
2. Algorithmic Frameworks for Batch Active Learning
Determinantal Point Processes (DPP):
DPPs offer a probabilistic framework to select diverse and informative batches by placing a -subset probability proportional to the determinant of a positive semidefinite kernel . Quality and diversity are encoded via:
- : informativeness score (e.g., model uncertainty)
- : embedding vector for similarity
- Hyperparameters jointly control the trade-off The batch query set is selected as
with as the kernel-based similarity (e.g., Gaussian kernel) (Bıyık et al., 2019).
Efficient approximations include:
- Greedy determinant maximization (-approximate)
- Convex relaxation plus coordinate rounding (-approximate)
Submodular and Matching-based Methods:
Submodular functions naturally capture diminishing returns for information or diversity. For example, the bounded coordinated matching (BCM) framework leverages simulations of a sequential policy to guide batch selection: samples of possible -step sequential paths are aggregated to build a batch matching distribution over -sets, with the batch set chosen by minimizing the summed matching cost to these simulated -sets—a supermodular minimization problem (Azimi et al., 2012).
Greedy and Clustering-based Batch Selection:
Weighted -means (or -medoids) clustering in feature or embedding space is an efficient heuristic. By clustering the most uncertain points, batch selection picks the cluster centers or nearest representatives, ensuring diversity while focusing on informative regions (Zhdanov, 2019). In large-scale scenarios, hierarchical agglomerative clustering (HAC) is leveraged within cluster-margin or core-set selection frameworks to allow batch sizes up to per round (Citovsky et al., 2021).
Stochastic Generative Policies:
Batch selection via generative flow networks (GFN) uses learned policies to directly sample size- batches in proportion to a batch-level reward, typically the joint mutual information between batch labels and model parameters. This compositional generation avoids explicit combinatorial search and supports amortized training via lookahead updates (Malik et al., 2023).
Adaptive Batch Size Mechanisms:
Probabilistic numerics (PN) reframes batch selection as kernel-quadrature (integration) over the acquisition function, using precision constraints on the integration error to adaptively determine the required batch size per round. This enables a principled trade-off between exploration (larger batches) and exploitation (smaller batches), without fixing batch size a priori (Adachi et al., 2023).
3. Informativeness, Diversity, and Redundancy Mitigation
A central challenge in batch active learning is balancing informativeness and diversity to avoid batch redundancy:
- Informativeness is typically measured using predictive uncertainty (entropy, margin), expected model change, or information gain.
- Diversity is enforced by penalizing within-batch similarity (distance in embedding space, kernel repulsion in DPP). DPPs natively realize this by suppressing batches of mutually similar points (Bıyık et al., 2019). Greedy facility-location, -means heuristics, and explicit diversity penalties are employed to similar effect (Zhdanov, 2019, Citovsky et al., 2021).
- Redundancy in batch—arising from selecting similar points—directly reduces information gain per label, especially for large batch sizes. Methods such as successive elimination, boundary medoids, and core-set strategies are deployed to enhance batch-level coverage and mitigate information overlap (Bıyık et al., 2018, Chapman et al., 2023).
Implicit mechanisms, such as using submodular or mutual information criteria, further ensure that the marginal gain of correlated points is minimized (Li et al., 2022).
4. Empirical and Theoretical Comparison: Sequential vs. Batch
Empirical evaluations consistently show nuanced trade-offs:
- Sequential active learning achieves the highest label efficiency (fewest queries to reach a target accuracy) but suffers from high retraining and wall-clock costs (Bıyık et al., 2019, Azimi et al., 2012, Pessemier et al., 2022).
- Batch active learning, when equipped with effective diversity mechanisms (DPPs, matching, clustering), can approach or match the performance of sequential methods in terms of downstream accuracy, while providing substantial gains in annotation throughput (Citovsky et al., 2021, Chapman et al., 2023).
- Parallelization gains dominate when annotation time, not computation, is the primary bottleneck.
- In domains with strong label adaptivity (e.g., dense recommender data), sequential modes retain an accuracy edge due to immediate model updates. In sparse or low-information settings, batch and sequential modes converge in performance (Pessemier et al., 2022).
Theoretical guarantees are available for various classes of batch algorithms:
- Submodular maximization (greedy) yields a approximation (Li et al., 2022)
- Supermodular minimization (reverse greedy) provides class-optimality under steepness conditions (Azimi et al., 2012)
- DPP-based batch selection gives (greedy) or (convex) approximations (Bıyık et al., 2019)
- In linear separator settings, diversity-aware batch samplers can achieve a -improved label complexity over classical bounds, where for -dimensional product spaces (Citovsky et al., 2021)
5. Domain-Specific Innovations and Extensions
Multi-fidelity and Budget-Constrained Batch AL:
Batch selection under budget and fidelity constraints optimizes the mutual information between acquired data and the high-fidelity target variable, using a weighted greedy strategy to maximize information per unit cost. This extends the batch AL paradigm to settings such as physics-informed machine learning or engineering design, where query costs and correlations are nonuniform (Li et al., 2022).
Preference-Based and User-Adaptive Batch AL:
In human-in-the-loop reward function learning and cold-start recommendation, batch active learning is reinterpreted as querying trajectory or item sets in parallel, balancing human throughput with information gain. Batch strategies leveraging joint entropy and diverse feature set selection (e.g., medoid, boundary medoid, successive elimination) enable theoretically controlled overheads in query count, with wall-time reductions proportional to batch size (Bıyık et al., 2018, Pessemier et al., 2022).
Large-Scale, Graph, and Application-Driven AL:
Batch techniques are tightly integrated with graph convolutional architectures, transfer learning embeddings, and core-set selection for application domains such as remote sensing, medical imaging, and computer vision (Chapman et al., 2023, Caramalau et al., 2020). Graph-based selection (e.g., LocalMax, DAC) exploits metric structure for efficient batch query covering and diversity.
Adaptive and Time-Varying Settings:
Sequentially adaptive frameworks with bounded-drift MLE targets design batch sizes and importance-sampling distributions on the fly, guaranteeing uniform excess risk bounds over unconstrained time horizons (Bu et al., 2018). Probabilistic numerics-driven methods dynamically reduce batch sizes as exploitation dominates late-stage learning (Adachi et al., 2023).
6. Practical Trade-offs and Method Selection Guidelines
A distillation of practical implications across the literature:
- Batch size selection: Larger batches speed up annotation and reduce retraining frequency, but increase risk of query redundancy; batch size can be adaptively reduced as the model approaches convergence (Adachi et al., 2023, Bu et al., 2018).
- Choice of method: For settings where annotation latency dominates, batch methods with DPP, matching, or clustering-based diversity are justified. At small data or when label budgets are extremely tight, sequential methods may be preferred (Bıyık et al., 2019, Bıyık et al., 2018).
- Computational considerations: Batch AL methods scale to massive pools via single-stage clustering or approximate search, offering batch selection on with batch sizes – in realistic runtimes (Citovsky et al., 2021, Malik et al., 2023). Core-set and submodular-based methods may become impractical at such scale unless combined with filtering or approximation heuristics.
- Label allocation: In cost-sensitive, multi-fidelity, or constrained-algorithmic settings, batch mutual information per unit-cost delivers strong benefit/cost efficiency and natural batch-level diversification (Li et al., 2022).
7. Open Questions and Future Directions
- Theory-practice gap: Sharp theoretical guarantees exist for special batch objective classes (submodular, DPP), but real-world batch selection often uses more heuristic strategies; analysis of embedding-based or domain-adaptive diversity criteria remains open.
- Dynamic and nonstationary environments: While adaptation to changing targets via online drift estimation is established, few batch methods natively account for rapid concept drift or heterogeneous pools (Bu et al., 2018, Adachi et al., 2023).
- Human-in-the-loop optimization: For scenarios with explicit human annotation bottlenecks, batch methods with empirical or provable information-retention properties per unit human time are especially relevant, but practical tradeoffs between wall-time and total query count require further investigation (Bıyık et al., 2018, Chapman et al., 2023).
- Extending batch AL to structured, graph-based, or multimodal domains: Scalability, diversity, and informativeness remain challenging, particularly as the space of possible queries becomes combinatorially structured (Caramalau et al., 2020).
In conclusion, both sequential and batch active learning are supported by a mature algorithmic ecosystem. Sequential policies retain maximal adaptivity and label efficiency, often at the cost of throughput and computation. Batch approaches, when equipped with principled diversity control—via DPPs, submodular optimization, matching, or robust clustering—can closely approach or even match sequential performance while offering substantial gains in parallel query acquisition, human throughput, and computational efficiency in domains where wall-clock, retraining, or cost constraints are predominant (Bıyık et al., 2019, Azimi et al., 2012, Malik et al., 2023, Citovsky et al., 2021).