Bandit-Based Cluster Sampling

Updated 5 January 2026

Bandit-based cluster sampling is a framework that employs multi-armed bandit feedback to adaptively explore and recover latent clustering structures in sequential observation environments.
It leverages sequential probing with GLR tests and LUCB-style confidence bounds, guided by information-theoretic lower bounds dependent on cluster separations and sizes.
Applications span online clustering, federated best-arm identification, graph signal recovery, and robotic motion planning, achieving notable sample savings.

Bandit-based cluster sampling encompasses a class of algorithms and theoretical tools for efficient, adaptive identification of group structure in sequential observation environments, where the cost of probing a data source is comparable to multi-armed bandit (MAB) feedback. This paradigm appears across online clustering, sublinear graph signal recovery, pure-exploration federated learning, reward-dependent planning, and active unsupervised learning, with the core methodological challenge being to minimize the sample or communication budget required for discovering latent partitions while satisfying high-probability correctness constraints.

1. Formal Frameworks: Bandit-Based Clustering Paradigms

Bandit-based cluster sampling formulations are typically characterized by:

Unknown Latent Partition: A set of arms/items/sources, each associated with a (possibly high-dimensional) mean parameter, are grouped into disjoint or overlapping clusters, determined either by equality of means or proximity under a distance metric (Thuot et al., 2024, Yang et al., 2022, Chandran et al., 20 Jan 2025).
Sequential, Adaptive Probing: At each round, the learner selects a source (and possibly sub-source or feature), obtains a noisy observation, and adaptively updates sampling strategies.
Fixed-Confidence Objectives: The learner must, with probability at least $1-\delta$ (“ $\delta$ -PAC”), output the true clustering (up to label permutation) or cluster-specific structure (e.g., best arm per cluster) using the fewest samples/arm-pulls (Yang et al., 2022, Yash et al., 15 May 2025, Thuot et al., 2024).
Information-Theoretic Lower Bounds: The minimal achievable sample complexity is governed by intrinsic cluster separations, dimensionality, and minimal cluster size parameters.

Three canonical models are prominent:

Bandit Feedback Clustering: Arms indexed by $m=1,\dots,M$ , means $\mu_m\in \mathbb{R}^d$ , with an unknown partition into $K$ clusters; pulling arm $m$ yields $X\sim \mathcal{N}(\mu_m, I_d)$ (Yang et al., 2022, Chandran et al., 20 Jan 2025, Thuot et al., 2024).
Federated Clustered Bandits: $N$ agents assigned to $M$ clusters, each cluster with its own i.i.d. bandit problem; assignment unknown to the learner, goal is best-arm identification per agent (Yash et al., 15 May 2025).
Feature-Selective Bandit Clustering: Items with feature-vectors in $\mathbb{R}^d$ partitioned by prototype equality, where at each step the learner chooses item and feature to probe; aim is to identify the partition with minimal queries (Graf et al., 14 Mar 2025).

2. Information-Theoretic Lower Bounds on Sample Complexity

Distinct lower bounds govern the sample budget for bandit-based cluster sampling, typically derived via change-of-measure arguments and minimax rates over covering alternatives:

Separation- and Cluster-Size-Dependent Bounds: For $M$ arms, $K$ clusters, minimal inter-cluster center separation $\Delta_*$ , and smallest cluster fraction $\theta_*$ , the tight expected sample complexity for discovering the partition with error $\leq \delta$ is (Thuot et al., 2024, Yang et al., 2022):

$T^* \gtrsim N + \frac{\sigma^2}{\Delta_*^2} \left[ N \ln\frac{N}{\delta} + \sqrt{dKN \ln \frac{N}{\delta}} \right]$

The first term corresponds to the 1d-best-arm budget; the second accounts for high-dimensional discrimination.

Alternative-Based Characterization: For general mean matrices $\mu$ , sampling proportions $w\in \Delta_M$ , and alternative configurations $\lambda$ violating the true clustering $c$ , the lower bound is (Yang et al., 2022, Chandran et al., 20 Jan 2025):

$T^*(\mu)^{-1} = \sup_{w\in\Delta_M} \inf_{\lambda\in \text{Alt}(\mu)} \frac{1}{2} \sum_{m=1}^M w_m \|\lambda_m - \mu_m\|^2$

Federated Clustered Bandit Lower Bounds: Pure-exploration best-arm identification with unknown agent-to-bandit assignments incurs a lower bound:

$\mathbb{E}[T] \gtrsim \max\{ M(K-M), N \} \cdot \frac{\log(1/\delta)}{\Delta^2}$

where $\Delta$ is a typical arm-gap parameter (Yash et al., 15 May 2025).

3. Principal Algorithmic Approaches

A diversity of algorithmic blueprints has emerged, exploiting both fixed cluster structures and adaptive, feedback-driven strategies.

3.1. Adaptive Cluster-Splitting via GLR/Test Statistics

BOC/ATBOC: These algorithms maintain plug-in or confidence-based cluster estimates, repeatedly select most-informative arms using convex minimax optimization, and stop with a generalized likelihood ratio (GLR) criterion when discriminatory evidence surpasses an explicit threshold (Yang et al., 2022, Chandran et al., 20 Jan 2025).
Average-Tracking and D-Tracking: Adaptive sampling ensures the per-arm sample allocation converges to the information-theoretic optimum for distinguishing the true partition from worst-case alternatives (Yang et al., 2022, Chandran et al., 20 Jan 2025).

3.2. Confidence Bound and LUCB-Style Methods

LUCBBOC: Exploits LUCB-style lower and upper confidence bounds for pairwise mean differences, guiding sampling toward near-ambiguous arm pairs and critical inter/intra-cluster edges. This approach avoids costly global optimization at each step while maintaining $\delta$ -PAC recovery (Chandran et al., 20 Jan 2025).

3.3. Federated and Multi-Agent Clustering Protocols

Cl-BAI / BAI-Cl: Combine cluster-discovery and best-arm identification via successive elimination. Cl-BAI clusters agents by similarity in empirical means, then identifies cluster-specific best arms; BAI-Cl reverses this sequence for improved communication efficiency under small $M$ (Yash et al., 15 May 2025).
Instance-Optimal BAI-Cl++: Achieves minimax-optimality (up to polylogarithmic factors) in sample complexity as $N\rightarrow\infty$ and $M$ constant.

3.4. Feature-Selective and Sparsity-Exploiting Methods

BanditClustering: Leverages sequential halving to adaptively discover discriminative features and outlier representatives, resulting in algorithms with worst-case optimality up to polylogarithmic terms in settings with sparse cluster-inducing features (Graf et al., 14 Mar 2025).
ACB: Employs sequential search for cluster representatives using unbiased squared-distance tests, then assigns remaining items by adapted nearest-center rules, closing the batch-vs-active computation gap (Thuot et al., 2024).

3.5. Clustered Bandit Regret Minimization

Clus-UCB: Integrates cluster structure into KL-UCB indices via intra-cluster pooling, matching information-theoretic regret lower bounds under known cluster assignments and widths (Gore et al., 4 Aug 2025).
Multi-Level (Hierarchical) Thompson Sampling: Implements a layered TS architecture, sampling at cluster and arm levels, yielding regret bounds that shrink with cluster quality and strong dominance (Carlsson et al., 2021).

A summary table of major algorithmic families:

Algorithm	Model/Setting	Asymptotic Guarantee
BOC/ATBOC	Online clustering (Gaussian arms)	Tight $\log(1/\delta)$ sample bound
LUCBBOC	Same as above (LUCB-style)	Sample complexity ×2 lower bound
Cl-BAI/BAI-Cl	Federated (multi-agent) clustering	Minimax-optimal under $M \ll N$
BanditClustering	Feature-selective item clustering	Optimal up to log-factors
ACB	Active $K$ -means with bandit	Matches lower bound
Clus-UCB, TSC	Bandits with known clusters	Asymptotic regret optimality

4. Impact of Structure, Dimensionality, and Clustering Objectives

Intra-Cluster Variation: Algorithms such as ATBOC and LUCBBOC generalize beyond the strictly homogenous cluster model (i.e., $\mu_i$ need not be identical within each cluster) and provide fixed-confidence recovery even for $K>2$ and multidimensional contexts (Chandran et al., 20 Jan 2025).
Dimension-Dependent Regimes: A key transition point occurs when $d \sim N \ln(N/\delta)/K$ , shifting the dominant term in the minimax lower bound from one determined by sample-based thresholding to one governed by high-dimensional geometry (Thuot et al., 2024).
Sparsity and “Right Feature” Considerations: In high-dimensional clustering with structured differences, e.g., only a small subset of features separating clusters, adaptive feature selection critically improves efficiency (Graf et al., 14 Mar 2025).

5. Applications and Empirical Results

Bandit-based cluster sampling underpins several practical domains:

Adaptive Sampling for Graph Signals: MAB-based sampling policies, trained via gradient ascent, robustly outperform uniform or random-walk strategies in recovering piecewise-constant signals on graphs (Abramenko et al., 2018).
Online Market Segmentation and Virus Variant Discovery: Adaptive querying uncovers latent user segments or viral strains using orders-of-magnitude fewer observations than uniform sampling (Yang et al., 2022, Chandran et al., 20 Jan 2025).
Robotic Motion Planning: Sampling over clustered transition spaces in RRT planners, with rewards estimated via region clustering, produces faster path-cost minimization and higher execution success under uncertainty (Faroni et al., 2023).
Federated Learning: Clustering agents by empirical reward profiles enables efficient, high-confidence best-arm identification while balancing sample and communication complexity (Yash et al., 15 May 2025).
MovieLens and Yelp Benchmarks: On real-world recommendation datasets, federated clustered bandit methods yield up to 72% sample savings relative to naive baselines (Yash et al., 15 May 2025, Chandran et al., 20 Jan 2025).

6. Theoretical and Practical Considerations

Several key theoretical and implementation points emerge:

Active vs. Passive (Batch) Sampling: Active, adaptive querying closes the computation-information gap found in batch clustering; polynomial-time algorithms achieve the information-theoretic minimum under mild assumptions (Thuot et al., 2024).
Order-Optimality and Instance Adaptivity: ATBOC is order-optimal (factor-2) relative to the minimax lower bound; LUCBBOC matches empirical performance at greatly reduced computational cost (Chandran et al., 20 Jan 2025).
Scalability: Deployment is efficient—main bottlenecks are K-means-type optimization for cluster center updates (cost $O(M K d)$ per iteration) and small convex subproblems for sample allocation or confidence-bound computation (Yang et al., 2022, Chandran et al., 20 Jan 2025).
Hyperparameter Selection: Sample allocation, thresholds, and confidence parameters are typically scheduled so as to logarithmically control the failure probability; doubling and sequential halving subroutines provide practical adaptivity to unknown instance parameters (Graf et al., 14 Mar 2025, Thuot et al., 2024).
Generalizations and Open Problems: Further directions include extensions to unknown $K$ (model selection), robustness to heavy-tailed/heteroscedastic noise, hierarchical and overlapping cluster structures, and integration with contextual or non-stationary reward models (Gore et al., 4 Aug 2025, Carlsson et al., 2021, Thuot et al., 2024).