Papers
Topics
Authors
Recent
2000 character limit reached

SOMMAB: Semi-Overlapping Multi-Armed Bandit

Updated 2 January 2026
  • SOMMAB is a framework that enables efficient pure-exploration in multi-bandit problems by using semi-overlapping evaluation groups.
  • It generalizes the GapE algorithm to leverage shared feedback, which decreases overall sample complexity and improves error bounds.
  • It finds practical use in multi-task learning, federated learning, and multi-agent systems by ensuring candidate consistency via role-swapping.

The semi-overlapping multi-(multi-armed) bandit (SOMMAB) framework provides a pure-exploration model for problems in which multiple entities (such as tasks, clients, or agents) seek to identify their optimal partners, actions, or configurations, based on shared—but typically asymmetric and computationally intensive—evaluations. Distinct from classical multi-armed or multi-bandit formulations, SOMMAB allows a single evaluation to yield feedback for several bandits due to explicit structural overlap among their arms. This makes SOMMAB an efficient abstraction for sequential identification tasks in domains where composite trials inform multiple learning subproblems, including multi-task learning, federated learning, and coalition formation in multi-agent systems (Antos et al., 31 Dec 2025).

1. Model Specification

The SOMMAB model consists of MM entities, each of which induces a local pure-exploration bandit problem. Each entity mm is associated with a sparse candidate list Cm2V{m}C_m \subset 2^{V \setminus\{m\}} of admissible support sets, with each SCmS \in C_m treated as a distinct arm for bandit mm. Arm rewards are distributed as νm,k\nu_{m,k}, independently sampled, bounded on [0,b][0, b], and with unknown mean μm,k\mu_{m,k}.

Arms are structured into semi-overlapping evaluation groups G{(m,k)}G\subseteq\{(m,k)\}, where each group G={(m1,k1),...,(mr,kr)}G = \{(m_1, k_1), ..., (m_r, k_r)\} represents sets of arms “pulled together” in a single trial. Pulling a group at cost one yields a sample for each arm in GG, with the number of groups denoted by nn and the effective samples collected summarized as m,kTm,k(n)rn\sum_{m, k} T_{m, k}(n) \geq r n. For each bandit mm, the recommended arm at the conclusion of nn trials is Jm(n)argmaxkμ^m,k(n)J_m(n) \in \arg\max_{k} \hat{\mu}_{m,k}(n). A strong duality condition on candidate sets enforces that, for any SCaS \in C_a and bSb \in S, the "role-swapped" set S=(S{b}){a}S' = (S \setminus\{b\}) \cup \{a\} must belong to CbC_b, ensuring structural coherence.

The overlap degree rr characterizes the minimum size of group GG; r=1r=1 recovers the classical, non-overlapping multi-bandit problem. The semi-overlapping design is the defining feature: each group pull simultaneously updates statistics for several arms across bandits.

2. Generalized GapE Algorithm

The GapE algorithm, originally developed by Gabillon et al. for non-overlapping multi-bandit settings, is generalized for SOMMAB to exploit the structured overlap. The algorithm proceeds as follows (see Algorithm 3.1 in (Antos et al., 31 Dec 2025)):

  1. Initialization: Every arm is pulled ll times by repeatedly sampling their full evaluation group, yielding Tm,k=lT_{m,k}=l for all (m,k)(m, k).
  2. Adaptive Exploration: For rounds t=lMK+1,,nt = lMK+1,\ldots, n:

    • Compute, for each arm (m,k)(m, k), a gap-based index

    Bm,k(t)=Δ^m,k(t1)+baTm,k(t1)B_{m,k}(t) = -\hat{\Delta}_{m,k}(t-1) + b \sqrt{\frac{a}{T_{m,k}(t-1)}}

    where Δ^m,k(t)\hat{\Delta}_{m,k}(t) is the empirical gap and aa is the exploration parameter. - Select the arm with maximal index, and pull its associated semi-overlap group GG, incrementing Tp,jT_{p,j} and updating means and gaps for every (p,j)G(p,j)\in G.

  3. Recommendation: Each bandit mm outputs Jm(n)=argmaxkμ^m,k(n)J_m(n) = \arg\max_k \hat{\mu}_{m,k}(n) as its recommended arm.

This structure allows each trial to propagate information to several arms and, consequently, several bandits, significantly reducing overall sample complexity.

3. Theoretical Results and Error Bounds

The primary theoretical advance consists of new exponential error bounds for best-arm identification in the SOMMAB setting, extending and sharpening those for classical multi-bandit pure-exploration.

Let H=m=1Mk=1Kmb2Δm,k2H = \sum_{m=1}^M\sum_{k=1}^{K_m} \frac{b^2}{\Delta_{m,k}^2} be the global complexity parameter, where Δm,k=μm,μm,k\Delta_{m,k}= \mu_{m,*} - \mu_{m,k}. For any initialization l[1,152]l\in[1,152] and appropriate exploration parameter aa, if nlmKmn\geq l\sum_mK_m, the worst-case error probability satisfies:

(n)=maxmP[Jm(n)km]2(mKm)nexp{2ac2}\ell(n) = \max_m \mathbb{P}[J_m(n)\neq k_m^*] \leq 2 (\sum_m K_m) n \exp\{-2ac^2\}

where the explicit computation of a,ca, c depends on l,r,Hl, r, H as detailed in [(Antos et al., 31 Dec 2025), Theorem 4.2]. For l=152l=152,

(n)<2(mKm)nexp(rnmKm+141H36)\ell(n) < 2 (\sum_{m} K_m) n \exp\left(-\frac{r n - \sum_mK_m + 1}{41H - 36}\right)

which reveals that the exponent’s dependence on rr is linear: each group evaluation supplies rr samples, so higher overlap yields sharper guarantees per unit cost.

Comparison to prior work demonstrates a factor of 2×2\times3.5×3.5\times improvement in the constant in front of $1/H$ relative to Gabillon et al. (2011), where for r=1r=1, l=1l=1, the exponent was (nMK)/(144H)-(n - MK)/(144H) (Antos et al., 31 Dec 2025).

4. Sample Complexity and Budget Trade-offs

Improved error constants translate to practical reductions in sampling costs. To guarantee error probability (n)δ\ell(n)\leq \delta, sample complexity satisfies

rn41H[log(2MKn)+log(1/δ)]+MKr n \gtrsim 41 H [\log(2MK n) + \log(1/\delta)] + MK

The effective number of group pulls nn is thus reduced by a factor rr compared to non-overlapping MMAB, as each group evaluation aids rr arms: n[H/r]O(log(MK/δ))n \approx [H/r] \cdot O(\log(MK/\delta)). In typical applications, when rMr \approx M, SOMMAB approaches the efficiency of a single-bandit problem, despite the multi-bandit structure, representing substantial computational and statistical gains in sparsely-coupled settings (Antos et al., 31 Dec 2025).

5. Applications Across Learning Paradigms

SOMMAB formalizes and unifies several families of learning problems where shared evaluations with asymmetric feedback are intrinsic. The mapping between specific domains and SOMMAB’s primitives is as follows:

Domain Entity/Bandit Interpretation Arm Interpretation
Multi-task/Auxiliary-task Task Subset of auxiliary tasks
Federated Learning Client Set of collaborating clients
Multi-agent Coalition Agent Set of partner agents
  • In multi-task and auxiliary-task learning, group trials correspond to jointly training various target and auxiliary sets; overlap arises as multi-target configurations simultaneously update relevant statistics for all involved tasks.
  • For federated learning, each entity is a client and arms are coalitions; practical semi-overlap emerges due to joint rounds of communication and computation, and the algorithmic structure informs orchestrated coalition selection at the cloud level.
  • In multi-agent systems, the model identifies optimal asymmetric help-sets via stochastic cooperative task execution, leveraging the overlap from coalitional execution to improve identification efficiency.

The ability to re-use a single composite evaluation across multiple interlocking bandits is the source of SOMMAB’s statistical and computational savings, as quantified explicitly by the dependence on rr in the core bounds (Antos et al., 31 Dec 2025).

6. Structural Properties and Practical Implications

Strong duality and role-swapping enforce that the candidate structures across all entities remain consistent, which is essential for meaningful structural sharing. SOMMAB is especially well-suited to scenarios with sparse candidate lists but high potential for evaluation overlap, making it particularly beneficial in large-scale networked or multi-agent contexts with asymmetric partner selection needs.

A plausible implication is that in environments where practical constraints enforce frequent joint evaluations (due to cost, privacy, or communication limitations), SOMMAB’s sample-complexity benefits may be essential for scalable, data-efficient support network identification.

Further formal definitions, model details, lemmas, proofs, and algorithmic subtleties are provided in [(Antos et al., 31 Dec 2025), Sections 2–4].

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Semi-Overlapping Multi-(Multi-Armed) Bandit (SOMMAB).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube