Adversarial Bandit over Bandits (ABoB)
- Adversarial Bandit over Bandits (ABoB) is a hierarchical framework that uses meta-learning and Lipschitz-aware methods to address complex adversarial bandit problems.
- It combines an online-within-online meta-learning approach with a two-level structure to adapt hyperparameters and manage configurations in large action spaces.
- Theoretical and empirical results demonstrate significant regret improvements over flat baselines, benefiting applications like online configuration management.
Adversarial Bandit over Bandits (ABoB) denotes a class of hierarchical algorithms for online learning in repeated adversarial bandit environments, characterized by a two-level organization that either leverages meta-learning for hyper-parameter adaptation across episodes or exploits metric structures in large action spaces for configuration management. Distinct instantiations have been introduced in contexts ranging from meta-learned adversarial multi-armed bandits (Osadchiy et al., 2022) to hierarchical Lipschitz-aware adversarial bandits for online configuration management in metric spaces (Avin et al., 25 May 2025).
1. Formal Problem Setting
1.1. Online-within-Online Meta-Bandit
Consider a learner that sequentially solves adversarial bandit episodes (tasks). Within each episode , the learner faces a -armed adversarial bandit problem with rounds. The per-round loss for arm at round in episode is . The learner selects arm , suffering loss , with only the chosen arm's loss observed.
The per-episode best arm is defined as
yielding the notion of total regret
This "online-within-online" setup captures meta-learning for adversarial bandit sequences, where outer-level adaptation is possible if the empirical distribution of the best arms across episodes, , is non-uniform (Osadchiy et al., 2022).
1.2. Metric Bandit with Lipschitz Adversaries
In large-scale online configuration management, the learner operates on a finite action set with a predefined metric . The adversary is oblivious and Lipschitz: for all , for each round , where is the expected reward set in advance (Avin et al., 25 May 2025). The objective is to minimize the regret
2. Algorithmic Structure and Methodologies
2.1. Meta-Learning ABoB Framework (Osadchiy et al., 2022)
The ABoB meta-learning framework comprises an inner learner (Tsallis-INF with ) executed independently within each episode and two outer meta-learners that respectively select the inner learner's learning-rate and its starting distribution . The algorithmic loop for each episode is as follows:
- Initialize .
- For to :
- Sample .
- Observe .
- Construct an unbiased estimator
- Update via OMD with Tsallis-regularizer:
where and is the -divergence.
Output .
Meta-learners update via a continuous exponentiated-weights method on a surrogate loss and via Follow-The-Leader (FTL) on cumulative -divergence (Osadchiy et al., 2022).
2.2. Hierarchical ABoB Structure (Avin et al., 25 May 2025)
Consider arms partitioned into clusters . The hierarchical architecture operates as:
Cluster-Level (Parent Bandit): A parent learner (e.g., EXP3, Tsallis-INF) samples a cluster index using maintained weights .
Arm-Level (Child Bandits): Within the selected cluster , a local learner selects an arm .
Arm is played; reward is observed.
The child bandit for is updated with .
The parent learner is updated with an unbiased cluster-reward estimate using importance weighting,
where are the sampling distributions.
This modular construction allows any adversarial bandit algorithm to be utilized at both levels.
3. Regret Analysis and Theoretical Guarantees
3.1. Meta-Learning Regret Bounds (Osadchiy et al., 2022)
Assuming a minimal per-episode gap and suitable such that to guarantee correct identification of , the expected regret satisfies
where is the Tsallis entropy of the best-arm distribution. In the favorable regime , the result improves upon the standard adversarial bound.
Key bound components:
Learning-rate meta-learner achieves excess for all .
FTL on initialization yields .
Horizon-in-hindsight term bounded by .
3.2. Hierarchical ABoB Regret Bounds (Avin et al., 25 May 2025)
Worst-case (), both cluster and internal regrets scale as
For , this recovers flat EXP3's .
Under the additional structure that rewards within each cluster are -Lipschitz (with ),
and with an appropriately “Lipschitz-aware” child algorithm, regret decomposes as
Choosing and yields .
4. Implementation Details
4.1. Meta-Learning ABoB (Algorithmic Schema)
Inner Tsallis-INF with OMD Update:
1 2 3 4 5 6 7 8 9 |
def inner_INF(T, eta_s, w_s, delta_min): x = w_s.copy() # ensure x ∈ Δ_{δ} for t in range(1, T+1): I_t = sample_discrete(x) loss_I = observe_loss(I_t) hat_ell = [0.]*len(x) hat_ell[I_t] = loss_I / x[I_t] x = OMD_update(x, hat_ell, eta_s, delta_min) return np.argmin(sum_hat_ell) |
Maintain density over .
Draw .
Initialization-point meta-learner (FTL):
- At , set .
4.2. Hierarchical ABoB (Algorithmic Schema)
Hierarchical pseudocode:
1 2 3 4 5 6 7 8 9 |
for t in range(1, T+1): I = parent_bandit.sample() # Select cluster index a = child_bandits[I].sample() # Select arm in cluster reward = play(a) child_bandits[I].update(a, reward) pi_cl = parent_bandit's probabilities pi_c = child_bandits[I].probabilities hat_R = reward / (pi_cl[I] * pi_c[a]) parent_bandit.update(I, hat_R) |
5. Empirical Evaluation
Empirical validation of both meta-learning and hierarchical ABoB is reported in their respective domains.
Meta-learning ABoB achieves problem-dependent improvements in multi-episode adversarial bandit scenarios, with regret scaling when is non-uniform (Osadchiy et al., 2022).
Hierarchical ABoB demonstrates substantial regret reduction in both synthetic and real-world configuration management, e.g., with :
- 1D stochastic: ABoB(Tsallis-INF) achieves 67% regret reduction.
- 2D Lipschitz adversarial: Reductions up to 91%.
- Storage-cluster tuning: flat Tsallis-INF regret ≈7,584; with ABoB () regret ≈5,543 (27–49% reduction, ).
Nearest-neighbor empirical reward differences confirm the Lipschitz assumption in real system traces (Avin et al., 25 May 2025).
6. Applications and Extensions
ABoB’s modular design allows it to be deployed in any repeated bandit setting where inter-episode or inter-arm structure exists.
- Configuration management: Large-scale distributed storage systems with parameter sets exhibiting local smoothness. Dynamically adapts to shifting workload optima.
- Meta-learning in adversarial environments: Tasks where non-uniform best-arm distributions can be exploited for improved cumulative performance.
- Algorithm-agnostic hierarchy: Any adversarial bandit algorithm (EXP3, Tsallis-INF, etc.) is compatible, permitting further advances as base algorithms improve.
This suggests that hierarchical and meta-learned ABoB are especially effective in scenarios combining large action spaces, underlying regularity, and adversarial environment shifts, substantially reducing regret compared with flat baselines while retaining worst-case performance guarantees.