Bi-Level Contextual Bandit Framework

Updated 20 November 2025

Bi-Level Contextual Bandit Framework is a dual-layer online learning architecture that combines a contextual bandit operating on encoded inputs with a meta-level optimizing task-specific parameters.
It integrates adaptive feature extraction, auto-tuning hyperparameters, multi-task sharing, and policy arbitration to manage non-stationarity, fairness, and resource constraints.
Empirical studies demonstrate its superior performance over traditional flat models, achieving lower cumulative regret and robust adaptation in dynamic environments.

A bi-level contextual bandit framework is a class of online learning algorithms that orchestrate decision-making at two interdependent layers, each serving a distinct statistical or strategic role. At the lower (base) level, a contextual bandit selects arms given appropriately encoded contexts; at the upper (meta) level, another procedure adaptively optimizes a meta-decision process such as feature representation, hyperparameter selection, allocation across populations, or policy arbitration. The bi-level architecture is motivated by settings where the optimal choice of bandit model, configuration, or resource allocation is itself context or history-dependent, or where fairness, adaptivity, or heterogeneity across sub-populations must be addressed. Recent works formalize, implement, and empirically validate bi-level frameworks for adaptive representation learning (Lin et al., 2018), dynamic tuning (Ding et al., 2021, Kang et al., 2023), multi-task sharing (Jiang et al., 30 Oct 2025), hybrid arbitration (Galozy et al., 2020), and resource allocation with delay-aware fairness (Almasi et al., 13 Nov 2025).

1. Bi-Level Problem Formalization

A typical bi-level contextual bandit system divides the learning process into two nested, interacting optimization problems:

Lower Level (Inner): A contextual bandit policy operates on a context-embedded or parameterized space, selecting arms to maximize cumulative (possibly delayed) reward.
Upper Level (Meta): An external controller (meta-policy) either tunes parameters, selects representations, allocates resources, switches policies, or sets constraints that affect the operative base bandit.

Representative formalisms include:

Adaptive Feature Extraction: The context $x_t\in\mathbb{R}^N$ is mapped to a latent $z_t\in\mathbb{R}^{d_i}$ via a data-driven adaptive encoder $e_i$ , with meta-decision governing the choice and adaptation of $e_i$ (Lin et al., 2018).
Auto-Tuning Hyperparameters: The inner bandit $B(\theta)$ depends on a hyperparameter $\theta$ , which is adaptively tuned online at the meta-level by a multi-armed (EXP3-type) or continuum-armed (Zooming-TS) outer loop (Ding et al., 2021, Kang et al., 2023).
Multi-Task Sharing: Top-level hierarchical priors or empirical Bayes procedures share information across tasks by estimating global parameters, which inform per-task contextual bandits (Jiang et al., 30 Oct 2025).
Policy Arbitration: A referee mechanism arbitrates between heterogeneous policies (e.g., standard contextual bandit and state-based MAB) based on relative observed performance (Galozy et al., 2020).
Resource Allocation: A meta-layer budgets resource allocation across subgroups under fairness, delay, and operational constraints, with the base-layer bandit targeting the highest-responding individuals within each cell (Almasi et al., 13 Nov 2025).

2. Mathematical Objectives and Layerwise Optimization

The dual objectives in bi-level frameworks respectively target:

Inner Level: Maximize cumulative reward, typically via Bayesian or frequentist contextual bandit estimators (e.g., Thompson Sampling, UCB, GLM-based indices), conditioned on the current meta-setting (encoder, hyperparameter, allocation, or policy switch variable).
Outer Level: Optimize over representation (minimize reconstruction or clustering loss), hyperparameter (minimize tuning or calibration regret), group allocation (maximize utility under fairness), or policy mixture (arbitrate with minimum meta-regret).

As a concrete example, in adaptive representation learning (Lin et al., 2018), the outer optimization minimizes autoencoder cluster loss

$L_j(\theta_j) = \mathbb{E}_{x\in C_j}\Vert x - d_j(e_j(x;\theta_j); \phi_j)\Vert^2$

and the inner level minimizes cumulative contextual bandit regret using the encoded $z_t$ .

In auto-tuned bandits (Ding et al., 2021), the meta-objective is to minimize

$R_{\text{total}} = R_{\text{warm-up}} + R_{\text{inner}}(\theta^*) + R_{\text{tune}}$

where $R_{\text{tune}}$ is the regret induced by outer-loop hyperparameter sampling.

3. Algorithmic Structures and Pseudocode Schematics

Bi-level contextual bandit algorithms are implemented through explicit nested loops or coupled updates, typically comprising:

Pre-training and Initialization: Offline or warm-up phases for cluster formation or parameter estimation.
Online Meta-Decision: At each round or mini-batch, the meta-layer selects or updates the meta-decision (e.g., best encoder, hyperparameter, policy probability, allocation vector).
Context Embedding / Parameterization: The meta-decision determines the mapping or operational configuration.
Base Bandit Optimization: Arm selection and reward observation using the current context representation/parameterization.
Feedback and Adaptivity: Both levels are updated using observed rewards, with periodic or event-driven meta-parameter adjustment.

1. Pre-train: Cluster D, train k autoencoders on clusters.
2. For each mini-batch:
    For each context x_t:
        Assign to cluster j_t (nearest centroid).
        Encode z_t = e_{j_t}(x_t).
        Run Thompson Sampling to select arm a_t.
        Update bandit stats with observed reward.
    If batch end:
        (mini-batch: re-cluster; online: update centroid)
        Retrain encoders on new cluster data.

1. For each candidate hyperparameter θ_j, initialize EXP3 weights.
2. At each round:
    Sample θ_j by EXP3 probabilities.
    Inner bandit B(θ_j) selects arm, observes reward.
    Update EXP3 with importance-weighted reward.

4. Statistical Guarantees and Regret Analysis

Regret analysis in bi-level frameworks decomposes along the dual hierarchy:

Representation/Meta Tuning Regret: The additional regret incurred due to the adaptation or selection process at the upper level.
Base Bandit Regret: Conditional regret under a (possibly history-dependent) meta-configuration.

Standard regret rates are:

$O(d \sqrt{T \log T})$ for contextual Thompson sampling with fixed or adaptively learned encodings (Lin et al., 2018).
$E[R(T)] = \tilde{O}(T^{2/3}) + \tilde{O}(\sqrt{nT \ln n})$ for single-parameter two-layer tuning, and $E[R(T)] = \tilde{O}(T^{2/3}) + \sum_\ell \tilde{O}(\sqrt{n_\ell T \ln n_\ell})$ for $L$ -parameter syndicated bandits (Ding et al., 2021).
For continuous hyperparameter search with zooming dimension $p_z$ , dynamic regret $R_{\text{outer}}(T) \leq \tilde{O}(T^{(p_z+2)/(p_z+3)})$ (Kang et al., 2023).
For multi-task hierarchical Bayes, frequentist regret bound $O\left( d \sqrt{n \log n [\sum_j \log n_j + \log N]/\delta} \right)$ for ebmUCB, adding an extra $\sqrt{\log n}$ for ebmTS (Jiang et al., 30 Oct 2025).

No new regret bounds are derived for adaptive representation learning; standard finite-time bounds of the downstream bandit algorithm are invoked (Lin et al., 2018).

5. Representative Empirical Results and Scenario-Specific Behaviors

Empirical results across frameworks consistently demonstrate that bi-level structures outperform flat or non-adaptive baselines, especially in non-stationary or heterogeneous regimes.

Adaptive Feature Extraction (Lin et al., 2018):

In stationary MNIST, embedding-based CB converges faster (3000–5000 steps) than vanilla CB (lags >5000 steps).
Under context drift or label non-stationarity, online embedding adapts rapidly and secures superior cumulative reward, while static CB collapses to random performance.

Hyperparameter Auto-Tuning (Ding et al., 2021, Kang et al., 2023):

Syndicated Bandits tuning both exploration and regularization achieves 10–30% lower regret than single-parameter or grid search methods.
Continuous dynamic tuning (CDT) with Zooming-TS achieves regret scaling $T^{(p+2)/(p+3)}$ for $p$ hyperparameters, robustly outperforming grid discretization.

Multi-Task/Bayesian Transfer (Jiang et al., 30 Oct 2025):

Empirical Bayesian multi-bandit methods achieve lower cumulative and per-instance regret than unshared or non-hierarchical baselines, especially for data-poor tasks.

Policy Arbitration (Galozy et al., 2020):

In mHealth-inspired simulation, referee mechanism dynamically concentrates on contextual CB or MAB depending on context corruption and state persistence, uniformly reducing regret versus static policies.

Resource Allocation with Delayed Feedback (Almasi et al., 13 Nov 2025):

Meta-level resource-allocation bandit (MetaCUB) achieves 20–40% lower cumulative regret compared to flat bandits, and fairness metrics (allocation-rate parity) are tightly controlled near 1.00 for all subgroups, outperforming baselines by large margins under delay and operational constraints.

6. Practical Implementation and Adaptivity Considerations

Bi-level bandit frameworks are characterized by:

Modular Decomposition: The separation of meta and base-bandit logic is critical, enabling plug-and-play adaptation of bandit strategies within meta-policies.
Efficient Feedback Integration: Both levels typically admit low per-round complexity ( $O(d^2)$ for inner learning, $O(\log T)$ or $O(n)$ for meta-updates).
Adaptivity to Non-Stationarity: Online or mini-batch re-clustering, rolling parameter estimates, and meta-level restarts are integrated to mitigate non-stationarity.
Constraints Management: Population-level constraints (fairness, budgets, cooldowns) are systematically encoded at the meta level, often via surrogate utility maximization or UCB over allocations (Almasi et al., 13 Nov 2025).

A plausible implication is that as application domains demand greater adaptivity, interpretability, and fairness, bi-level contextual bandit architectures will become foundational, offering a principled route to operationalizing meta-decision control in sequential online environments.

7. Connections to Existing and Emerging Literature

The bi-level contextual bandit methodology subsumes a diversity of existing approaches:

Representation Learning: Adaptive clustering and autoencoder-based encodings (Lin et al., 2018).
Meta-Learning and Hyperparameter Optimization: Multi-armed and continuum-armed bandits at the meta-policy level for tuning algorithmic parameters (Ding et al., 2021, Kang et al., 2023).
Hierarchical Multi-Task Learning: Empirical Bayes and transfer learning for joint estimation across multiple tasks with instance-specific heterogeneity (Jiang et al., 30 Oct 2025).
Policy Ensembles and Arbitration: Mixture-of-expert and referee-based approaches for resolving uncertainty regarding the reliability of context vs. state-based knowledge (Galozy et al., 2020).
Resource Allocation and Fairness: Explicit meta-layer handling sub-budgeting, fairness, and delay-aware optimization in societal-scale resource allocation (Almasi et al., 13 Nov 2025).

The bi-level framework is thus a unifying abstraction, and recent empirical and theoretical work establishes its statistical and operational benefits. Ongoing research extends this paradigm to reinforcement learning, structured combinatorial bandits, nonparametric transfer, and domains with richer feedback and constraints.