Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 25 tok/s Pro
Kimi K2 190 tok/s Pro
GPT OSS 120B 419 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Adversarial Bandit over Bandits (ABoB)

Updated 9 November 2025
  • Adversarial Bandit over Bandits (ABoB) is a hierarchical framework that uses meta-learning and Lipschitz-aware methods to address complex adversarial bandit problems.
  • It combines an online-within-online meta-learning approach with a two-level structure to adapt hyperparameters and manage configurations in large action spaces.
  • Theoretical and empirical results demonstrate significant regret improvements over flat baselines, benefiting applications like online configuration management.

Adversarial Bandit over Bandits (ABoB) denotes a class of hierarchical algorithms for online learning in repeated adversarial bandit environments, characterized by a two-level organization that either leverages meta-learning for hyper-parameter adaptation across episodes or exploits metric structures in large action spaces for configuration management. Distinct instantiations have been introduced in contexts ranging from meta-learned adversarial multi-armed bandits (Osadchiy et al., 2022) to hierarchical Lipschitz-aware adversarial bandits for online configuration management in metric spaces (Avin et al., 25 May 2025).

1. Formal Problem Setting

1.1. Online-within-Online Meta-Bandit

Consider a learner that sequentially solves SS adversarial bandit episodes (tasks). Within each episode s{1,,S}s \in \{1,\ldots,S\}, the learner faces a dd-armed adversarial bandit problem with TT rounds. The per-round loss for arm ii at round tt in episode ss is s,t(i)[0,1]\ell_{s,t}(i) \in [0,1]. The learner selects arm Is,tI_{s,t}, suffering loss s,t(Is,t)\ell_{s,t}(I_{s,t}), with only the chosen arm's loss observed.

The per-episode best arm is defined as

js:=argmini{1,,d}t=1Ts,t(i),j^*_s := \arg\min_{i \in \{1,\ldots,d\}} \sum_{t=1}^T \ell_{s,t}(i),

yielding the notion of total regret

RS,T:=s=1St=1T[s,t(Is,t)s,t(js)].R_{S,T} := \sum_{s=1}^S \sum_{t=1}^T \left[\ell_{s,t}(I_{s,t}) - \ell_{s,t}(j^*_s)\right].

This "online-within-online" setup captures meta-learning for adversarial bandit sequences, where outer-level adaptation is possible if the empirical distribution of the best arms across episodes, pi:=(1/S){s:js=i}p^*_i := (1/S)|\{s: j^*_s = i\}|, is non-uniform (Osadchiy et al., 2022).

1.2. Metric Bandit with Lipschitz Adversaries

In large-scale online configuration management, the learner operates on a finite action set K={1,,k}K = \{1,\ldots,k\} with a predefined metric D:K×KR0D:K\times K\to\mathbb{R}_{\ge0}. The adversary is oblivious and Lipschitz: for all a,aKa,a'\in K, ct(a)ct(a)D(a,a)|c_t(a)-c_t(a')| \le D(a,a') for each round tt, where ct(a)c_t(a) is the expected reward set in advance (Avin et al., 25 May 2025). The objective is to minimize the regret

R(T)=maxaKt=1Tct(a)t=1Tct(at).R(T) = \max_{a^* \in K} \sum_{t=1}^T c_t(a^*) - \sum_{t=1}^T c_t(a_t).

2. Algorithmic Structure and Methodologies

The ABoB meta-learning framework comprises an inner learner (Tsallis-INF with q=1/2q=1/2) executed independently within each episode and two outer meta-learners that respectively select the inner learner's learning-rate ηs\eta_s and its starting distribution wsw_s. The algorithmic loop for each episode is as follows:

  • Initialize xs,1:=wsx_{s,1} := w_s.
  • For t=1t = 1 to TT:
    • Sample Is,txs,tI_{s,t} \sim x_{s,t}.
    • Observe s,t(Is,t)\ell_{s,t}(I_{s,t}).
    • Construct an unbiased estimator

    ^s,t(i)=s,t(Is,t)1{Is,t=i}xs,t(i).\hat{\ell}_{s,t}(i) = \frac{\ell_{s,t}(I_{s,t})\mathbf{1}\{I_{s,t}=i\}}{x_{s,t}(i)}. - Update via OMD with Tsallis-regularizer:

    xs,t+1=argminxΔδ{ηs^s,t,x+D1/2(xxs,t)},x_{s,t+1} = \arg\min_{x \in \Delta_\delta} \left\{ \eta_s \langle\hat{\ell}_{s,t},x\rangle + D_{1/2}(x \,\Vert\, x_{s,t}) \right\},

    where Δδ={xΔ:xiδmin}\Delta_\delta = \{x \in \Delta: x_i \ge \delta_\mathrm{min}\} and D1/2D_{1/2} is the β\beta-divergence.

  • Output j^s:=argminit=1T^s,t(i)\hat{j}_s := \arg\min_i \sum_{t=1}^T \hat{\ell}_{s,t}(i).

Meta-learners update ηs\eta_s via a continuous exponentiated-weights method on a surrogate loss and wsw_s via Follow-The-Leader (FTL) on cumulative β\beta-divergence (Osadchiy et al., 2022).

Consider kk arms partitioned into pp clusters P={P1,,Pp}P = \{P^1,\ldots,P^p\}. The hierarchical architecture operates as:

  1. Cluster-Level (Parent Bandit): A parent learner (e.g., EXP3, Tsallis-INF) samples a cluster index ItI_t using maintained weights wi(cl)w^\mathrm{(cl)}_i.

  2. Arm-Level (Child Bandits): Within the selected cluster PItP^{I_t}, a local learner selects an arm atPIta_t \in P^{I_t}.

  3. Arm ata_t is played; reward rtr_t is observed.

  4. The child bandit for PItP^{I_t} is updated with (at,rt)(a_t, r_t).

  5. The parent learner is updated with an unbiased cluster-reward estimate using importance weighting,

R^=rt/(πcl(It)πIt(at)),\hat{R} = r_t / (\pi_\mathrm{cl}(I_t) \cdot \pi_{I_t}(a_t)),

where πcl,πIt\pi_\mathrm{cl}, \pi_{I_t} are the sampling distributions.

This modular construction allows any adversarial bandit algorithm to be utilized at both levels.

3. Regret Analysis and Theoretical Guarantees

Assuming a minimal per-episode gap δ>0\delta > 0 and suitable δmin\delta_\mathrm{min} such that Tδminlogd/δ2T \delta_\mathrm{min} \gtrsim \log d / \delta^2 to guarantee correct identification of jsj^*_s, the expected regret satisfies

E[RS,T]=O(STd1/4H1/2(p)+Sd1/4+1/14(Tδ)4/7T+S6/7Tlog1/3S),E[R_{S,T}] = O\left(S\sqrt{T}\,d^{1/4} \sqrt{H_{1/2}(p^*)} + S d^{1/4+1/14}(T\delta)^{-4/7} \sqrt{T} + S^{6/7}\sqrt{T} \log^{1/3} S \right),

where H1/2(p)H_{1/2}(p^*) is the Tsallis entropy of the best-arm distribution. In the favorable regime H1/2(p)logdH_{1/2}(p^*) \ll \log d, the result improves upon the standard O(STd)O(S\sqrt{Td}) adversarial bound.

Key bound components:

  • Learning-rate meta-learner achieves excess sE[Ls(ηs)]sLs(v)=O(Smin{ϵ,v}+[logS]/[ϵ2(1dϵ)3/2δmin3/4])\sum_s E[L_s(\eta_s)] - \sum_s L_s(v) = O(S \min\{\epsilon, v\} + [\log S]/[\epsilon^2(1-d\epsilon)^{3/2} \delta_\mathrm{min}^{3/4}]) for all vv.

  • FTL on initialization yields sE[D1/2(ej^sδws)]minwsD1/2(ej^sδw)=O(d/δminlogS)\sum_s E[D_{1/2}(e_{\hat j_s}^{\delta}\Vert w_s)] - \min_{w} \sum_s D_{1/2}(e_{\hat j_s}^{\delta}\Vert w) = O(\sqrt{d/\delta_\mathrm{min}} \log S).

  • Horizon-in-hindsight term bounded by SH1/2(p)+O(dϵ/δminS)S H_{1/2}(p^*) + O(d \epsilon / \sqrt{\delta_\mathrm{min}} S).

Worst-case (pkp \approx k), both cluster and internal regrets scale as

E[R(T)]=O(pTlnp)+O(kTlnkp).E[R(T)] = O\left(\sqrt{p T \ln p}\right) + O\left(\sqrt{k T \ln \frac{k}{p}}\right).

For p=kp = k, this recovers flat EXP3's O(kTlnk)O(\sqrt{k T \ln k}).

Under the additional structure that rewards within each cluster PiP^i are \ell-Lipschitz (with 1\ell \ll 1),

i,  a,bPi,  t:ct(a)ct(b),\forall i,\; a,b \in P^i,\; t: \quad |c_t(a) - c_t(b)| \le \ell,

and with an appropriately “Lipschitz-aware” child algorithm, regret decomposes as

E[R(T)]=O(pTlnp)+O(kTln(k/p)).E[R(T)] = O\left(\sqrt{p T \ln p}\right) + O\left(\ell \sqrt{k T \ln (k/p)}\right).

Choosing p=kp = \sqrt{k} and k1/4\ell \le k^{-1/4} yields O(k1/4Tlnk)O(k^{1/4} \sqrt{T \ln k}).

4. Implementation Details

4.1. Meta-Learning ABoB (Algorithmic Schema)

Inner Tsallis-INF with OMD Update:

1
2
3
4
5
6
7
8
9
def inner_INF(T, eta_s, w_s, delta_min):
    x = w_s.copy()  # ensure x ∈ Δ_{δ}
    for t in range(1, T+1):
        I_t = sample_discrete(x)
        loss_I = observe_loss(I_t)
        hat_ell = [0.]*len(x)
        hat_ell[I_t] = loss_I / x[I_t]
        x = OMD_update(x, hat_ell, eta_s, delta_min)
    return np.argmin(sum_hat_ell)
Learning-rate meta-learner (continuous EWOO):

  • Maintain density ps(v)exp(γτ<sLτ(v))p_s(v) \propto \exp(-\gamma \sum_{\tau < s} L_\tau(v)) over v[ϵ,E]v \in [\epsilon, E].

  • Draw ηs=vps(v)dv\eta_s = \int v\,p_s(v)\,dv.

Initialization-point meta-learner (FTL):

  • At ss, set ws=argminwΔδτ=1s1D1/2(ej^τδw)w_s = \arg\min_{w \in \Delta_\delta} \sum_{\tau=1}^{s-1} D_{1/2}(e_{\hat j_\tau}^\delta \Vert w).

4.2. Hierarchical ABoB (Algorithmic Schema)

Hierarchical pseudocode:

1
2
3
4
5
6
7
8
9
for t in range(1, T+1):
    I = parent_bandit.sample()        # Select cluster index
    a = child_bandits[I].sample()     # Select arm in cluster
    reward = play(a)
    child_bandits[I].update(a, reward)
    pi_cl = parent_bandit's probabilities
    pi_c = child_bandits[I].probabilities
    hat_R = reward / (pi_cl[I] * pi_c[a])
    parent_bandit.update(I, hat_R)
Clustering is typically performed via kk-means on normalized configuration features, with p=kp = \sqrt{k} clusters yielding practical and theoretical benefit (Avin et al., 25 May 2025).

5. Empirical Evaluation

Empirical validation of both meta-learning and hierarchical ABoB is reported in their respective domains.

  • Meta-learning ABoB achieves problem-dependent improvements in multi-episode adversarial bandit scenarios, with regret scaling O(STd1/4H1/2(p))O(S\sqrt{T}d^{1/4}\sqrt{H_{1/2}(p^*)}) when pp^* is non-uniform (Osadchiy et al., 2022).

  • Hierarchical ABoB demonstrates substantial regret reduction in both synthetic and real-world configuration management, e.g., with k=256,T=105106k=256,\,T=10^5-10^6:

    • 1D stochastic: ABoB(Tsallis-INF) achieves 67% regret reduction.
    • 2D Lipschitz adversarial: Reductions up to 91%.
    • Storage-cluster tuning: flat Tsallis-INF regret ≈7,584; with ABoB (p=16p=16) regret ≈5,543 (27–49% reduction, p<1014p<10^{-14}).

Nearest-neighbor empirical reward differences confirm the Lipschitz assumption in real system traces (Avin et al., 25 May 2025).

6. Applications and Extensions

ABoB’s modular design allows it to be deployed in any repeated bandit setting where inter-episode or inter-arm structure exists.

  • Configuration management: Large-scale distributed storage systems with parameter sets exhibiting local smoothness. Dynamically adapts to shifting workload optima.
  • Meta-learning in adversarial environments: Tasks where non-uniform best-arm distributions can be exploited for improved cumulative performance.
  • Algorithm-agnostic hierarchy: Any adversarial bandit algorithm (EXP3, Tsallis-INF, etc.) is compatible, permitting further advances as base algorithms improve.

This suggests that hierarchical and meta-learned ABoB are especially effective in scenarios combining large action spaces, underlying regularity, and adversarial environment shifts, substantially reducing regret compared with flat baselines while retaining worst-case performance guarantees.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Adversarial Bandit over Bandits (ABoB).