Papers
Topics
Authors
Recent
Search
2000 character limit reached

Meta-Learning Bandit Policies by Gradient Ascent

Published 9 Jun 2020 in cs.LG and stat.ML | (2006.05094v2)

Abstract: Most bandit policies are designed to either minimize regret in any problem instance, making very few assumptions about the underlying environment, or in a Bayesian sense, assuming a prior distribution over environment parameters. The former are often too conservative in practical settings, while the latter require assumptions that are hard to verify in practice. We study bandit problems that fall between these two extremes, where the learning agent has access to sampled bandit instances from an unknown prior distribution $\mathcal{P}$ and aims to achieve high reward on average over the bandit instances drawn from $\mathcal{P}$. This setting is of a particular importance because it lays foundations for meta-learning of bandit policies and reflects more realistic assumptions in many practical domains. We propose the use of parameterized bandit policies that are differentiable and can be optimized using policy gradients. This provides a broadly applicable framework that is easy to implement. We derive reward gradients that reflect the structure of bandit problems and policies, for both non-contextual and contextual settings, and propose a number of interesting policies that are both differentiable and have low regret. Our algorithmic and theoretical contributions are supported by extensive experiments that show the importance of baseline subtraction, learned biases, and the practicality of our approach on a range problems.

Citations (9)

Summary

  • The paper introduces a gradient ascent framework that meta-learns bandit policies to minimize Bayes regret and maximize expected reward.
  • It employs advanced baseline subtraction techniques to significantly reduce gradient estimation variance and ensure efficient convergence.
  • Empirical results demonstrate up to a 95% regret reduction in contextual bandit settings, validating its robustness and scalability.

Meta-Learning Bandit Policies by Gradient Ascent — Expert Summary

Problem Formulation and Meta-Learning Paradigm

The paper presents a meta-learning framework for bandit policy optimization situated between traditional minimax and Bayesian approaches. Instead of worst-case instance analysis or requiring a known prior, the assumed setting provides sampled bandit problem instances from an unknown distribution PP. The central objective is to learn parameterized, differentiable policies that minimize Bayes regret and maximize Bayes reward in expectation over PP. This two-level optimization problem—learning policies that adapt efficiently within each instance and tuning policy parameters meta-optimally over the prior—naturally motivates policy-gradient-based meta-learning.

Gradient-Based Policy Optimization Algorithm

The proposed algorithm ("GradBand") performs batch gradient ascent on policy parameters ww by generating simulated bandit trajectories on sampled instances from PP. Per-iteration complexity is O(Kmn)O(Kmn) where KK is the number of arms, mm is batch size, and nn is the horizon. Crucially, the algorithm is agnostic to the explicit functional form of PP—all required statistics are empirically estimated from sampled instances. The reward gradient derivation leverages additivity and sequential independence properties of the bandit process: Figure 1

Figure 1: The Bayes regret and reward gradients of and policies, showing gradient estimation variance for different baseline choices.

To address high gradient estimation variance, the methodology employs advanced forms of baseline subtraction, including both "optimal" (oracle) and "self" (run-based) baselines, yielding substantial variance reduction and more stable gradient trajectories under stochastic sampling.

Differentiable Policy Classes and Their Analysis

Non-Contextual Setting

Three differentiable policy families are introduced:

  • EXP3: A softmax-based randomized policy parameterized for exploration rate, suitable for adversarial bandits but generally over-explorative in stochastic regimes. The policy is analytically differentiable via a closed-form score-function.
  • SoftElim: A novel softmax policy that progressively "soft eliminates" arms with large empirical gaps and frequent sampling, modulated by a learnable exploration parameter. The regret bound for SoftElim is O(i1(16/Δi)logn)O(\sum_{i \neq 1} (16/\Delta_i)\log n) for KK arms, matching gap- and time-dependence of classic UCB algorithms, but tunable via ww. Figure 2

    Figure 2: The Bayes regret of and policies, as a function of gradient ascent iterations in a Bernoulli bandit.

  • RNN Policies: History-dependent policies parameterized by the weights of a recurrent neural network, which are meta-learned via gradients. The sequence encoding and softmax output enables instance-adaptive exploration strategies not tractable in analytic policies.

Contextual Setting

The framework generalizes to contextual bandits by learning a context projection WW to an informative subspace and leveraging linear bandit estimators in the projected space. Differentiable policies include:

  • Contextual SoftElim: Extends SoftElim to context-projected reward gaps and confidence widths, with regret scaling as O~(K2dn)\tilde{O}(K^2 d \sqrt{n}).
  • Contextual Thompson Sampling (TS): TS policies are differentiated by including sampled posterior means as explicit variables in the gradient, with the derived reward gradient involving expectations over posterior samples, leading to higher estimator variance.
  • ϵ\epsilon-greedy: Included as a baseline, with gradients only on scalar exploration rate.

Projection matrices learned by meta-gradient optimization reliably recover the relevant task subspace, outperforming method-of-moments baseline subspace estimators and simple bias-based regularization schemes. Figure 3

Figure 3: Covariance matrices Σθ\Sigma_\theta of synthetic contextual bandit problems, visualizing learned task subspaces.

Empirical Results: Regret Minimization and Robustness

Extensive experiments across simulated, synthetic, and real-world multi-class classification problems demonstrate:

  1. Substantial improvements in Bayes regret—optimized SoftElim and contextual policies outperform UCB, Thompson Sampling, and Gittins index baselines in most regimes.
  2. Baseline subtraction (notably bselfb^{self}) drastically reduces reward gradient variance, enabling efficient convergence within a few dozen gradient steps.
  3. Learned policies are robust to batch size, horizon length, and moderate prior misspecification. Regret scaling as logn\log n corroborates theoretical results in the main text.
  4. RNN policies display competitive performance, exhibiting instance-specialized exploration strategies and robustness to distractor arms in high-dimensional problems. Figure 4

    Figure 4: The Bayes regret of RNN policies, showing average and median performance over multiple optimization runs.

  5. On real-world datasets (e.g., UCI ML Repository), meta-learned contextual policies achieve up to 95% regret reduction versus untuned baselines, with empirical projection matrices recovering task-relevant features. Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5: The Bayes regret of , , and greedypoliciesonclassificationbenchmarksaftermetagradientoptimization.</p></p><h3class=paperheadingid=theoreticalconsiderationsandstrongclaims>TheoreticalConsiderationsandStrongClaims</h3><p>TheanalysisprovesconcavityoftheBayesrewardwithrespecttoexplorationhorizoninexplorethencommitpolicies,thusestablishingglobalconvergenceguaranteesforgradientascentinthisrestrictedsetting.Furthermore:</p><ul><li>Thepaperderives,forthefirsttime,therewardgradientforThompsonsamplingpoliciesinnoncontextualandcontextualsettings.</li><li>Empiricalevidencesuggestsnearconcavityandunimodalityofregretlandscapesinalltesteddifferentiablepolicies.</li><li>Thepaperclaimsthatbaselinesubtractedgradientestimatesallownearoptimalpolicylearningwithordersofmagnitudelesscomputationalexpensecomparedtoclassicaldynamicprogramming(Gittins),andgreaterstabilitythandeepRLmethods(DQN).</li></ul><h3class=paperheadingid=limitationscomputationaltradeoffsandscaling>Limitations,ComputationalTradeoffs,andScaling</h3><p>Themostcriticallimitationisvarianceinempiricalgradientestimation,particularlyacuteinRNNpolicies,whichmotivatesfurtherresearchinscalablecontrolvariatemethods.Althoughbaselinesubtractionand<ahref="https://www.emergentmind.com/topics/curriculumlearning"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">curriculumlearning</a>yieldpracticalimprovements,theoreticalconvergenceoutsideoftheestablishedconcavecasesremainsopen.Moreover,theframeworkassumestheabilitytosimulatecompletetrajectoriesonsampledinstances,whichmaynotalwaysbeaccessibleinappliedsettings.</p><p>Computationally,theapproachisparallelizableandfeasibleforproblemswithlarge-greedy policies on classification benchmarks after meta-gradient optimization.</p></p> <h3 class='paper-heading' id='theoretical-considerations-and-strong-claims'>Theoretical Considerations and Strong Claims</h3> <p>The analysis proves concavity of the Bayes reward with respect to exploration horizon in explore-then-commit policies, thus establishing global convergence guarantees for gradient ascent in this restricted setting. Furthermore:</p> <ul> <li>The paper derives, for the first time, the reward gradient for Thompson sampling policies in non-contextual and contextual settings.</li> <li>Empirical evidence suggests near-concavity and unimodality of regret landscapes in all tested differentiable policies.</li> <li>The paper claims that baseline-subtracted gradient estimates allow near-optimal policy learning with orders of magnitude less computational expense compared to classical dynamic programming (Gittins), and greater stability than deep RL methods (DQN).</li> </ul> <h3 class='paper-heading' id='limitations-computational-tradeoffs-and-scaling'>Limitations, Computational Tradeoffs, and Scaling</h3> <p>The most critical limitation is variance in empirical gradient estimation, particularly acute in RNN policies, which motivates further research in scalable control variate methods. Although baseline subtraction and <a href="https://www.emergentmind.com/topics/curriculum-learning" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">curriculum learning</a> yield practical improvements, theoretical convergence outside of the established concave cases remains open. Moreover, the framework assumes the ability to simulate complete trajectories on sampled instances, which may not always be accessible in applied settings.</p> <p>Computationally, the approach is parallelizable and feasible for problems with large K,, d,and, and nprovidedmemoryconstraintson—provided memory constraints on Kmn$ statistics are observed. Comparing Gittins index computation (days of compute) to meta-gradient policy learning (seconds–minutes), the method presents attractive scalability characteristics.

Implications, Practical Impact, and Future Directions

The meta-learning approach enables:

  • Automated, data-driven tuning of bandit algorithms for specific application regimes (clinical trials, adaptive recommendations, etc.) where accurate prior modeling is not feasible.
  • Instance-adaptive exploration guided by empirical prior characteristics as opposed to the worst-case or strictly Bayesian assumptions.
  • Integration of deep representation learning (e.g., RNNs, context projection) within a theoretically motivated framework.
  • Immediate extension to complex settings—generalized linear bandits, combinatorial actions, partial monitoring—by designing differentiable policy classes and reward structures. Figure 6

    Figure 6: The Bayes regret of different policies under varying meta-learned subspace configurations.

Research directions include: advancing variance reduction for high-dimensional policies, extending concavity results, and analyzing global convergence for softmax-based meta-learning in broader settings.

Conclusion

This work establishes a rigorous, scalable mechanism for meta-learning bandit policies by gradient ascent over sampled task priors. The integration of differentiable policy classes, theoretical analysis of reward gradients, and empirically validated variance reduction positions the approach as a practical alternative to conventional bandit and RL tuning. The framework's adaptability and performance gains suggest promising avenues for meta-learning exploration strategies in sequential decision-making under uncertainty.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.