Adversarial Bandit over Bandits (ABoB)

Updated 9 November 2025

Adversarial Bandit over Bandits (ABoB) is a hierarchical framework that uses meta-learning and Lipschitz-aware methods to address complex adversarial bandit problems.
It combines an online-within-online meta-learning approach with a two-level structure to adapt hyperparameters and manage configurations in large action spaces.
Theoretical and empirical results demonstrate significant regret improvements over flat baselines, benefiting applications like online configuration management.

Adversarial Bandit over Bandits (ABoB) denotes a class of hierarchical algorithms for online learning in repeated adversarial bandit environments, characterized by a two-level organization that either leverages meta-learning for hyper-parameter adaptation across episodes or exploits metric structures in large action spaces for configuration management. Distinct instantiations have been introduced in contexts ranging from meta-learned adversarial multi-armed bandits (Osadchiy et al., 2022) to hierarchical Lipschitz-aware adversarial bandits for online configuration management in metric spaces (Avin et al., 25 May 2025).

1. Formal Problem Setting

1.1. Online-within-Online Meta-Bandit

Consider a learner that sequentially solves $S$ adversarial bandit episodes (tasks). Within each episode $s \in \{1,\ldots,S\}$ , the learner faces a $d$ -armed adversarial bandit problem with $T$ rounds. The per-round loss for arm $i$ at round $t$ in episode $s$ is $\ell_{s,t}(i) \in [0,1]$ . The learner selects arm $I_{s,t}$ , suffering loss $\ell_{s,t}(I_{s,t})$ , with only the chosen arm's loss observed.

The per-episode best arm is defined as

$j^*_s := \arg\min_{i \in \{1,\ldots,d\}} \sum_{t=1}^T \ell_{s,t}(i),$

yielding the notion of total regret

$R_{S,T} := \sum_{s=1}^S \sum_{t=1}^T \left[\ell_{s,t}(I_{s,t}) - \ell_{s,t}(j^*_s)\right].$

This "online-within-online" setup captures meta-learning for adversarial bandit sequences, where outer-level adaptation is possible if the empirical distribution of the best arms across episodes, $p^*_i := (1/S)|\{s: j^*_s = i\}|$ , is non-uniform (Osadchiy et al., 2022).

1.2. Metric Bandit with Lipschitz Adversaries

In large-scale online configuration management, the learner operates on a finite action set $K = \{1,\ldots,k\}$ with a predefined metric $D:K\times K\to\mathbb{R}_{\ge0}$ . The adversary is oblivious and Lipschitz: for all $a,a'\in K$ , $|c_t(a)-c_t(a')| \le D(a,a')$ for each round $t$ , where $c_t(a)$ is the expected reward set in advance (Avin et al., 25 May 2025). The objective is to minimize the regret

$R(T) = \max_{a^* \in K} \sum_{t=1}^T c_t(a^*) - \sum_{t=1}^T c_t(a_t).$

2. Algorithmic Structure and Methodologies

The ABoB meta-learning framework comprises an inner learner (Tsallis-INF with $q=1/2$ ) executed independently within each episode and two outer meta-learners that respectively select the inner learner's learning-rate $\eta_s$ and its starting distribution $w_s$ . The algorithmic loop for each episode is as follows:

Initialize $x_{s,1} := w_s$ .
For $t = 1$ $t = 1$ to $T$ $T$ :
- Sample $I_{s,t} \sim x_{s,t}$ .
- Observe $\ell_{s,t}(I_{s,t})$ .
- Construct an unbiased estimator
$\hat{\ell}_{s,t}(i) = \frac{\ell_{s,t}(I_{s,t})\mathbf{1}\{I_{s,t}=i\}}{x_{s,t}(i)}.$ - Update via OMD with Tsallis-regularizer:

$x_{s,t+1} = \arg\min_{x \in \Delta_\delta} \left\{ \eta_s \langle\hat{\ell}_{s,t},x\rangle + D_{1/2}(x \,\Vert\, x_{s,t}) \right\},$

where $\Delta_\delta = \{x \in \Delta: x_i \ge \delta_\mathrm{min}\}$ and $D_{1/2}$ is the $\beta$ -divergence.
Output $\hat{j}_s := \arg\min_i \sum_{t=1}^T \hat{\ell}_{s,t}(i)$ .

Meta-learners update $\eta_s$ via a continuous exponentiated-weights method on a surrogate loss and $w_s$ via Follow-The-Leader (FTL) on cumulative $\beta$ -divergence (Osadchiy et al., 2022).

Consider $k$ arms partitioned into $p$ clusters $P = \{P^1,\ldots,P^p\}$ . The hierarchical architecture operates as:

Cluster-Level (Parent Bandit): A parent learner (e.g., EXP3, Tsallis-INF) samples a cluster index $I_t$ using maintained weights $w^\mathrm{(cl)}_i$ .
Arm-Level (Child Bandits): Within the selected cluster $P^{I_t}$ , a local learner selects an arm $a_t \in P^{I_t}$ .
Arm $a_t$ is played; reward $r_t$ is observed.
The child bandit for $P^{I_t}$ is updated with $(a_t, r_t)$ .
The parent learner is updated with an unbiased cluster-reward estimate using importance weighting,

$\hat{R} = r_t / (\pi_\mathrm{cl}(I_t) \cdot \pi_{I_t}(a_t)),$

where $\pi_\mathrm{cl}, \pi_{I_t}$ are the sampling distributions.

This modular construction allows any adversarial bandit algorithm to be utilized at both levels.

3. Regret Analysis and Theoretical Guarantees

Assuming a minimal per-episode gap $\delta > 0$ and suitable $\delta_\mathrm{min}$ such that $T \delta_\mathrm{min} \gtrsim \log d / \delta^2$ to guarantee correct identification of $j^*_s$ , the expected regret satisfies

$E[R_{S,T}] = O\left(S\sqrt{T}\,d^{1/4} \sqrt{H_{1/2}(p^*)} + S d^{1/4+1/14}(T\delta)^{-4/7} \sqrt{T} + S^{6/7}\sqrt{T} \log^{1/3} S \right),$

where $H_{1/2}(p^*)$ is the Tsallis entropy of the best-arm distribution. In the favorable regime $H_{1/2}(p^*) \ll \log d$ , the result improves upon the standard $O(S\sqrt{Td})$ adversarial bound.

Key bound components:

Learning-rate meta-learner achieves excess $\sum_s E[L_s(\eta_s)] - \sum_s L_s(v) = O(S \min\{\epsilon, v\} + [\log S]/[\epsilon^2(1-d\epsilon)^{3/2} \delta_\mathrm{min}^{3/4}])$ for all $v$ .
FTL on initialization yields $\sum_s E[D_{1/2}(e_{\hat j_s}^{\delta}\Vert w_s)] - \min_{w} \sum_s D_{1/2}(e_{\hat j_s}^{\delta}\Vert w) = O(\sqrt{d/\delta_\mathrm{min}} \log S)$ .
Horizon-in-hindsight term bounded by $S H_{1/2}(p^*) + O(d \epsilon / \sqrt{\delta_\mathrm{min}} S)$ .

Worst-case ( $p \approx k$ ), both cluster and internal regrets scale as

$E[R(T)] = O\left(\sqrt{p T \ln p}\right) + O\left(\sqrt{k T \ln \frac{k}{p}}\right).$

For $p = k$ , this recovers flat EXP3's $O(\sqrt{k T \ln k})$ .

Under the additional structure that rewards within each cluster $P^i$ are $\ell$ -Lipschitz (with $\ell \ll 1$ ),

$\forall i,\; a,b \in P^i,\; t: \quad |c_t(a) - c_t(b)| \le \ell,$

and with an appropriately “Lipschitz-aware” child algorithm, regret decomposes as

$E[R(T)] = O\left(\sqrt{p T \ln p}\right) + O\left(\ell \sqrt{k T \ln (k/p)}\right).$

Choosing $p = \sqrt{k}$ and $\ell \le k^{-1/4}$ yields $O(k^{1/4} \sqrt{T \ln k})$ .

4. Implementation Details

4.1. Meta-Learning ABoB (Algorithmic Schema)

Inner Tsallis-INF with OMD Update:

def inner_INF(T, eta_s, w_s, delta_min):
    x = w_s.copy()  # ensure x ∈ Δ_{δ}
    for t in range(1, T+1):
        I_t = sample_discrete(x)
        loss_I = observe_loss(I_t)
        hat_ell = [0.]*len(x)
        hat_ell[I_t] = loss_I / x[I_t]
        x = OMD_update(x, hat_ell, eta_s, delta_min)
    return np.argmin(sum_hat_ell)

Learning-rate meta-learner (continuous EWOO):

Maintain density $p_s(v) \propto \exp(-\gamma \sum_{\tau < s} L_\tau(v))$ over $v \in [\epsilon, E]$ .
Draw $\eta_s = \int v\,p_s(v)\,dv$ .

Initialization-point meta-learner (FTL):

At $s$ , set $w_s = \arg\min_{w \in \Delta_\delta} \sum_{\tau=1}^{s-1} D_{1/2}(e_{\hat j_\tau}^\delta \Vert w)$ .

4.2. Hierarchical ABoB (Algorithmic Schema)

Hierarchical pseudocode:

for t in range(1, T+1):
    I = parent_bandit.sample()        # Select cluster index
    a = child_bandits[I].sample()     # Select arm in cluster
    reward = play(a)
    child_bandits[I].update(a, reward)
    pi_cl = parent_bandit's probabilities
    pi_c = child_bandits[I].probabilities
    hat_R = reward / (pi_cl[I] * pi_c[a])
    parent_bandit.update(I, hat_R)

Clustering is typically performed via

k

-means on normalized configuration features, with

p = \sqrt{k}

clusters yielding practical and theoretical benefit (Avin et al., 25 May 2025).

5. Empirical Evaluation

Empirical validation of both meta-learning and hierarchical ABoB is reported in their respective domains.

Meta-learning ABoB achieves problem-dependent improvements in multi-episode adversarial bandit scenarios, with regret scaling $O(S\sqrt{T}d^{1/4}\sqrt{H_{1/2}(p^*)})$ when $p^*$ is non-uniform (Osadchiy et al., 2022).
Hierarchical ABoB demonstrates substantial regret reduction in both synthetic and real-world configuration management, e.g., with $k=256,\,T=10^5-10^6$ :
- 1D stochastic: ABoB(Tsallis-INF) achieves 67% regret reduction.
- 2D Lipschitz adversarial: Reductions up to 91%.
- Storage-cluster tuning: flat Tsallis-INF regret ≈7,584; with ABoB ( $p=16$ ) regret ≈5,543 (27–49% reduction, $p<10^{-14}$ ).

Nearest-neighbor empirical reward differences confirm the Lipschitz assumption in real system traces (Avin et al., 25 May 2025).

6. Applications and Extensions

ABoB’s modular design allows it to be deployed in any repeated bandit setting where inter-episode or inter-arm structure exists.

Configuration management: Large-scale distributed storage systems with parameter sets exhibiting local smoothness. Dynamically adapts to shifting workload optima.
Meta-learning in adversarial environments: Tasks where non-uniform best-arm distributions can be exploited for improved cumulative performance.
Algorithm-agnostic hierarchy: Any adversarial bandit algorithm (EXP3, Tsallis-INF, etc.) is compatible, permitting further advances as base algorithms improve.

This suggests that hierarchical and meta-learned ABoB are especially effective in scenarios combining large action spaces, underlying regularity, and adversarial environment shifts, substantially reducing regret compared with flat baselines while retaining worst-case performance guarantees.

PDF Markdown Chat (Pro)

References (2)

Online Meta-Learning in Adversarial Multi-Armed Bandits (2022)

Adversarial Bandit over Bandits: Hierarchical Bandits for Online Configuration Management (2025)

Follow Topic

Get notified by email when new papers are published related to Adversarial Bandit over Bandits (ABoB).

Adversarial Bandit over Bandits (ABoB)

1. Formal Problem Setting

1.1. Online-within-Online Meta-Bandit

1.2. Metric Bandit with Lipschitz Adversaries

2. Algorithmic Structure and Methodologies

2.1. Meta-Learning ABoB Framework (Osadchiy et al., 2022)

2.2. Hierarchical ABoB Structure (Avin et al., 25 May 2025)

3. Regret Analysis and Theoretical Guarantees

3.1. Meta-Learning Regret Bounds (Osadchiy et al., 2022)

3.2. Hierarchical ABoB Regret Bounds (Avin et al., 25 May 2025)

4. Implementation Details

4.1. Meta-Learning ABoB (Algorithmic Schema)

4.2. Hierarchical ABoB (Algorithmic Schema)

5. Empirical Evaluation

6. Applications and Extensions

Follow Topic

Continue Learning

Adversarial Bandit over Bandits (ABoB)

1. Formal Problem Setting

1.1. Online-within-Online Meta-Bandit

1.2. Metric Bandit with Lipschitz Adversaries

2. Algorithmic Structure and Methodologies

2.1. Meta-Learning ABoB Framework (Osadchiy et al., 2022)

2.2. Hierarchical ABoB Structure (Avin et al., 25 May 2025)

3. Regret Analysis and Theoretical Guarantees

3.1. Meta-Learning Regret Bounds (Osadchiy et al., 2022)

3.2. Hierarchical ABoB Regret Bounds (Avin et al., 25 May 2025)

4. Implementation Details

4.1. Meta-Learning ABoB (Algorithmic Schema)

4.2. Hierarchical ABoB (Algorithmic Schema)

5. Empirical Evaluation

6. Applications and Extensions

Follow Topic

Continue Learning

Related Topics