Meta-Learning Bandit Policies

Updated 19 November 2025

Meta-learning bandit policies are approaches that leverage shared task structures to reduce exploration costs and accelerate adaptation.
They employ an outer meta-learner to infer priors and representations, which dynamically guide bandit action selection in new tasks.
Empirical studies reveal significant performance gains in applications such as recommendation systems, Bayesian inference, and contextual decision-making.

Meta-learning bandit policies constitute a research frontier addressing how agents can leverage experience across related bandit problems to achieve superior performance—including lower regret, faster adaptation, and robust exploration/exploitation trade-offs—compared to classical bandit algorithms that treat every instance from scratch. The meta-learning paradigm operationalizes statistical strength sharing: an outer meta-learner infers prior knowledge, structural invariants, or representations by accumulating data from previous tasks, and this knowledge is dynamically incorporated to bias or initialize the inner loop (policy, model, or belief) controlling bandit action selection within novel tasks. The field encompasses algorithmic, theoretical, and applied advances, with performance gains established in settings from neural collaborative filtering to Bayesian prior inference, low-dimensional representation discovery, hierarchical clustering, and meta-optimization over compact policy classes.

1. Problem Settings and Formalizations

Meta-learning bandit policies are formalized as two-level stochastic processes. An environment samples a task (bandit instance) from an unknown or partially known distribution (the "meta-level" prior), and the agent interacts with this task via an inner loop (the "instance-level") for a finite or infinite horizon employing contextual or non-contextual bandit protocols. Representative formulations include:

Contextual multi-armed bandits with user- or item-level structure: In neural collaborative filtering bandits, each round delivers an arriving user $u_t$ with $k$ candidate arms $X_t=\{x_{t,1},...,x_{t,k}\}$ , and rewards $r_t$ are generated by unknown functions $h_{u_t}(x_{t,i})$ , possibly nonlinear (Ban et al., 2022).
Hierarchical Bayesian task distributions: A meta-prior $Q$ generates a prior $P_*$ on instance parameters, which then generate task-specific reward functions; e.g., Gaussian priors over arm means in Meta-Thompson Sampling (Kveton et al., 2021).
Shared or low-dimensional latent structure: Collections of bandit tasks are assumed to share an affine subspace or representation in $\mathbb{R}^d$ ; this is learned via online PCA or multitask regularization, and exploited for dimension reduction and improved regret (Bilaj et al., 31 Mar 2024, Cella et al., 2022).
Mixture of environments: In multi-environment linear bandits, tasks are sampled from a finite mixture of distributions $\rho_1,...,\rho_m$ over $\mathbb{R}^d$ , with environment labels sometimes latent and necessitating meta-classification before adaptation (Moradipari et al., 2022).
Meta-learning for exploration policies: The meta-learner parameterizes policy classes (neural or symbolic) and directly optimizes a regret or expected reward objective over a sample of related bandit problems (Kveton et al., 2020, Maes et al., 2012).
Meta-learning for simple regret minimization: When only the best arm after $n$ rounds matters (simple regret), meta-learning aggregates exploration patterns or prior estimates over repeated tasks (Azizi et al., 2022).

The main performance metric is meta-regret (or Bayes regret) over the sequence of tasks, contrasted with classical per-task regret or regret relative to an oracle with prior knowledge of all environment statistics.

2. Principal Meta-Learning Algorithms

Meta-learning bandit policies are instantiated through several algorithmic templates, each emphasizing different forms of knowledge transfer and adaptation:

Bayesian Meta-Thompson Sampling (Meta-TS and Meta-TSLB): These maintain a Bayesian meta-posterior on the task-distribution's parameters (e.g., arm means, prior mean/covariance) and, at the start of each new task, sample a prior from this meta-posterior to initialize Thompson Sampling within the task. Meta-update steps analytically integrate (or sample) over observed task histories. In contextual linear bandits, matrix-variate meta-posterior updates generalize Meta-TS to Meta-TSLB, achieving improved regret scaling in both task and time (Kveton et al., 2021, Li et al., 10 Sep 2024, Thornton et al., 2021).
Biased-regularized algorithms (Meta-OFUL frameworks): To exploit statistical alignment across linear bandit tasks, a meta-learned bias vector (e.g., population mean or cluster-specific mean for multi-environment settings) replaces the zero vector in ridge-regression regularization for OFUL updates. Pure-exploration phases or clustering algorithms can be employed to classify new tasks into environments before transfer (Cella et al., 2020, Moradipari et al., 2022).
Meta-learning surrogate models: Neural Process (NP) models act as flexible surrogates capturing uncertainty over context-arm–reward mappings. The NP is meta-trained across tasks to encode and decode context–reward relationships, with Thompson sampling or information-gain acquisition guiding exploration in the inner loop (Galashov et al., 2019).
Gradient-based meta-optimization over differentiable policy classes: Parameterized exploration strategies—softmax elimination, RNNs, or parameterized UCB—are optimized over a sample of bandit tasks by explicit reward-gradient ascent. Differentiable bandit policies are trained by gradient estimators tailored to the bandit feedback, with variance reduction via control variates (optimal-arm, self-play) (Boutilier et al., 2020, Kveton et al., 2020).
Imitation learning for exploration/exploitation policy transfer: Meta-learners imitate Bayes-optimal or oracle policies on simulated (or real) bandit histories using aggregation of offline data and DAgger-style updates. This approach is effective in the contextual bandit setting for automated learning of complex exploration heuristics (Sharaf et al., 2019).
Automated meta-learning over Q-functions: Offloading architecture and hyperparameter meta-optimization to AutoML frameworks, blocks of bandit data are sequentially processed as tasks and a general meta-learner orchestrates feature engineering, model selection, and exploitation/exploration scheduling (Dutta et al., 2019).
Hybrid model-selection/nearest-neighbor meta-policies: In specialized ecological or behavioral settings, such as bee decision making, meta-policies dynamically select among a pool of reference bandit policies (UCB, $\epsilon$ -greedy, LinUCB, etc.) by trajectory similarity, minimizing behavioral regret in a windowed fashion (Claeys et al., 18 Oct 2025).

3. Structural and Statistical Principles

Meta-learning bandit policies leverage statistical structure across tasks to reduce exploration burden and accelerate adaptation. Key principles include:

Hierarchical and cluster structure: Designs such as Meta-Ban (Ban et al., 2022) and Bayesian hierarchical frameworks (Wan et al., 2022, Thornton et al., 2021) exploit user/item clusters, environment partitions, or dynamic groupings to transfer inductive biases. Collapsing per-arm or per-user estimation into cluster-level inference (dynamic or static) tightens uncertainty quantification and regret bounds.
Representation learning and subspace meta-learning: When task parameters are assumed to lie (approximately) in a shared low-rank subspace, meta-learning recovers this structure (via trace-norm regularization or online PCA) and restricts the per-task policy class to this subspace, yielding dimensionally reduced confidence sets and tighter theoretical guarantees (Bilaj et al., 31 Mar 2024, Cella et al., 2022).
Model-based meta-learning with predictive uncertainty: Surrogate models such as Neural Processes encode posterior uncertainty about reward functions, enabling Thompson-style and information-gain driven action selection that natively balances exploration and exploitation across a distribution of tasks (Galashov et al., 2019).
Parameterization and policy gradient meta-optimization: Meta-learning frameworks parameterize policies in a family expressive enough to encode sophisticated exploration schedules, then explicitly differentiate the Bayes reward averaged over problem instances (Kveton et al., 2020). Baselines designed for low-variance estimation are crucial for stability and convergence.
Imitation of oracle or task-specific strategies: Algorithms such as MELEE (Sharaf et al., 2019) and the MAYA framework (Claeys et al., 18 Oct 2025) rely on access to oracles that simulate the Bayes-optimal or agent-specific policies, then aggregate this advice across synthetic or real tasks to distill an exploration-leading policy.

4. Theoretical Regret Bounds and Guarantees

Meta-learning bandit policies admit regret bounds that improve over classical worst-case rates by exploiting the meta-level structure:

Meta-Ban (neural collaborative filtering bandit): $O(\sqrt{T\log T})$ cumulative regret, eliminating a $\sqrt{\log T}$ factor present in prior linear or neural collaborative-clustering bandit algorithms (Ban et al., 2022).
Meta-TSLB for contextual linear bandits: $O((m+\log m)\sqrt{n\log n})$ Bayes regret, where $m$ is the number of tasks and $n$ steps per task (Li et al., 10 Sep 2024). The dependence is improved compared to prior Meta-TS for Gaussian bandits, which yielded $O(m\sqrt{n\log n} + \sqrt{m}n^2\sqrt{\log n})$ .
Meta-OFUL and regularized greedy meta-policies: In single-environment settings, transfer regret decays as $O(\sqrt{Td\lambda}E\|\theta-\mu\|)$ , with further improvements via averaging and multitask regularization if environment structure permits (Cella et al., 2020, Moradipari et al., 2022). Multi-environment meta-learning boosts performance by environment clustering and assigning per-cluster biases (Moradipari et al., 2022).
Shared subspace and representation learners: Regret per new task scales as $O(r\sqrt{N}(1\vee\sqrt{d/T}))$ where $r\ll d$ is the intrinsic dimension, $d$ the ambient, and $T$ the number of meta-tasks; this interpolates between $d\sqrt{N}$ when $T\ll d$ and the optimal $r\sqrt N$ rate as $T\gg d$ (Cella et al., 2022).
Bayesian meta-simple regret: For simple regret objectives, meta-learning with access to the meta-prior achieves $O(m/\sqrt n)$ meta-simple regret; the frequentist approach attains $O(\sqrt{m}n + m/\sqrt n)$ , optimal except for a higher constant in the exploration phase (Azizi et al., 2022).

These theoretical advances all rely on integrating information-theoretic analyses (mutual information, information ratios), online convex optimization, and high-dimensional geometry (empirical Fisher or projected covariance control).

5. Empirical Evaluations and Application Domains

Meta-learning bandit algorithms have demonstrated empirical superiority against classical baselines across a range of domains:

Personalized recommendation and collaborative filtering: Meta-Ban outperforms strong baselines (CLUB, NeuUCB-ONE, NeuUCB-IND, etc.) by 10–20% relative cumulative regret on MovieLens, Yelp, and synthetic bandit tasks (Ban et al., 2022).
Contextual bandit adaptation in recommendation and control: Neural Processes meta-learned on MovieLens achieve lower RMSE in prediction vs. multitask MLP or MAML, with uncertainty-driven querying further improving data efficiency (Galashov et al., 2019).
Imitation learning for exploration: MELEE demonstrates 7–10% normalized cumulative improvement over LinUCB, Bootstrapped UCB, and others on 300 contextual-bandit datasets (Sharaf et al., 2019).
Automated Q-learning pipelines: AutoML-based bandit meta-learners achieve 25–40% lower regret on real-world classification datasets (CoverType, Chess, Gamma, etc.) relative to strong online bandit baselines (Dutta et al., 2019).
Ecological and behavioral models: The MAYA meta-bandit framework achieves lowest mean and absolute error in cumulative regret when imitating bee behavior across diverse regimes, outperforming IRL, behavioral cloning, and GLM baselines, with high interpretability (Claeys et al., 18 Oct 2025).
Radar and adversarial applications: In waveform-agile radar target tracking, Bayesian Meta-TS reduces lost-tracking rate by $\approx$ 25% compared to uninformed Thompson Sampling and approaches the “oracle prior” limit after $10$ tasks (Thornton et al., 2021).

6. Open Problems and Future Directions

Open challenges and directions include:

Scalability and practical deployment: Theoretical guarantees for frameworks like Meta-Ban (Ban et al., 2022) rely on heavy over-parameterization regimes and extensive SGD, raising concerns about scalability for large $n$ or $d$ .
Task heterogeneity and nonstationarity: Fast-changing or nonstationary group structure (e.g., rapid cluster splitting/merging beyond the $\gamma$ -gap) or abrupt environment shifts call for more adaptive meta-inference (Ban et al., 2022, Claeys et al., 18 Oct 2025).
Generalization and robustness: Quantifying the degradation in regret when meta-learned priors or representations are misspecified or meta-distributions are shifted (e.g., by out-of-sample $\|{\epsilon}\|$ perturbation) is critical for robust deployment (Li et al., 10 Sep 2024).
Reinforcement learning integration: Extending neural collaborative or subspace meta-bandit ideas to full RL settings or continuous action spaces remains an open area, particularly with respect to credit assignment and delayed rewards (Ban et al., 2022).
Automated model selection and interpretability: As shown in ecological domains, explicit model bank selection (e.g., in MAYA) or symbolic meta-policy discovery (as in (Maes et al., 2012)) remains promising for interpretability, but the scaling properties and theoretical guarantees of such approaches need further analysis.

Systematically, meta-learning bandit policies continues to develop as a theoretically grounded, empirically validated, and increasingly practical approach to accelerating learning in sequential decision-making across related domains. The cross-pollination of advances in meta-optimization, Bayesian inference, and representation learning is driving the emergence of policies that adapt near-optimally from the first interaction on new tasks by leveraging all obtainable prior structure.