Meta-Learning Bandit Policies by Gradient Ascent

Published 9 Jun 2020 in cs.LG and stat.ML | (2006.05094v2)

Abstract: Most bandit policies are designed to either minimize regret in any problem instance, making very few assumptions about the underlying environment, or in a Bayesian sense, assuming a prior distribution over environment parameters. The former are often too conservative in practical settings, while the latter require assumptions that are hard to verify in practice. We study bandit problems that fall between these two extremes, where the learning agent has access to sampled bandit instances from an unknown prior distribution $\mathcal{P}$ and aims to achieve high reward on average over the bandit instances drawn from $\mathcal{P}$. This setting is of a particular importance because it lays foundations for meta-learning of bandit policies and reflects more realistic assumptions in many practical domains. We propose the use of parameterized bandit policies that are differentiable and can be optimized using policy gradients. This provides a broadly applicable framework that is easy to implement. We derive reward gradients that reflect the structure of bandit problems and policies, for both non-contextual and contextual settings, and propose a number of interesting policies that are both differentiable and have low regret. Our algorithmic and theoretical contributions are supported by extensive experiments that show the importance of baseline subtraction, learned biases, and the practicality of our approach on a range problems.

Abstract PDF Upgrade to Chat

Citations (9)

View on Semantic Scholar

Summary

The paper introduces a gradient ascent framework that meta-learns bandit policies to minimize Bayes regret and maximize expected reward.
It employs advanced baseline subtraction techniques to significantly reduce gradient estimation variance and ensure efficient convergence.
Empirical results demonstrate up to a 95% regret reduction in contextual bandit settings, validating its robustness and scalability.

Meta-Learning Bandit Policies by Gradient Ascent — Expert Summary

Problem Formulation and Meta-Learning Paradigm

The paper presents a meta-learning framework for bandit policy optimization situated between traditional minimax and Bayesian approaches. Instead of worst-case instance analysis or requiring a known prior, the assumed setting provides sampled bandit problem instances from an unknown distribution $P$ . The central objective is to learn parameterized, differentiable policies that minimize Bayes regret and maximize Bayes reward in expectation over $P$ . This two-level optimization problem—learning policies that adapt efficiently within each instance and tuning policy parameters meta-optimally over the prior—naturally motivates policy-gradient-based meta-learning.

Gradient-Based Policy Optimization Algorithm

The proposed algorithm ("GradBand") performs batch gradient ascent on policy parameters $w$ by generating simulated bandit trajectories on sampled instances from $P$ . Per-iteration complexity is $O(Kmn)$ where $K$ is the number of arms, $m$ is batch size, and $n$ is the horizon. Crucially, the algorithm is agnostic to the explicit functional form of $P$ —all required statistics are empirically estimated from sampled instances. The reward gradient derivation leverages additivity and sequential independence properties of the bandit process:

Figure 1: The Bayes regret and reward gradients of and policies, showing gradient estimation variance for different baseline choices.

To address high gradient estimation variance, the methodology employs advanced forms of baseline subtraction, including both "optimal" (oracle) and "self" (run-based) baselines, yielding substantial variance reduction and more stable gradient trajectories under stochastic sampling.

Differentiable Policy Classes and Their Analysis

Non-Contextual Setting

Three differentiable policy families are introduced:

EXP3: A softmax-based randomized policy parameterized for exploration rate, suitable for adversarial bandits but generally over-explorative in stochastic regimes. The policy is analytically differentiable via a closed-form score-function.
SoftElim: A novel softmax policy that progressively "soft eliminates" arms with large empirical gaps and frequent sampling, modulated by a learnable exploration parameter. The regret bound for SoftElim is $O(\sum_{i \neq 1} (16/\Delta_i)\log n)$ for $K$ arms, matching gap- and time-dependence of classic UCB algorithms, but tunable via $w$ .
Figure 2: The Bayes regret of and policies, as a function of gradient ascent iterations in a Bernoulli bandit.
RNN Policies: History-dependent policies parameterized by the weights of a recurrent neural network, which are meta-learned via gradients. The sequence encoding and softmax output enables instance-adaptive exploration strategies not tractable in analytic policies.

Contextual Setting

The framework generalizes to contextual bandits by learning a context projection $W$ to an informative subspace and leveraging linear bandit estimators in the projected space. Differentiable policies include:

Contextual SoftElim: Extends SoftElim to context-projected reward gaps and confidence widths, with regret scaling as $\tilde{O}(K^2 d \sqrt{n})$ .
Contextual Thompson Sampling (TS): TS policies are differentiated by including sampled posterior means as explicit variables in the gradient, with the derived reward gradient involving expectations over posterior samples, leading to higher estimator variance.
$\epsilon$ -greedy: Included as a baseline, with gradients only on scalar exploration rate.

Projection matrices learned by meta-gradient optimization reliably recover the relevant task subspace, outperforming method-of-moments baseline subspace estimators and simple bias-based regularization schemes.

Figure 3: Covariance matrices $\Sigma_\theta$ of synthetic contextual bandit problems, visualizing learned task subspaces.

Empirical Results: Regret Minimization and Robustness

Extensive experiments across simulated, synthetic, and real-world multi-class classification problems demonstrate:

Substantial improvements in Bayes regret—optimized SoftElim and contextual policies outperform UCB, Thompson Sampling, and Gittins index baselines in most regimes.
Baseline subtraction (notably $b^{self}$ ) drastically reduces reward gradient variance, enabling efficient convergence within a few dozen gradient steps.
Learned policies are robust to batch size, horizon length, and moderate prior misspecification. Regret scaling as $\log n$ corroborates theoretical results in the main text.
RNN policies display competitive performance, exhibiting instance-specialized exploration strategies and robustness to distractor arms in high-dimensional problems.
Figure 4: The Bayes regret of RNN policies, showing average and median performance over multiple optimization runs.
On real-world datasets (e.g., UCI ML Repository), meta-learned contextual policies achieve up to 95% regret reduction versus untuned baselines, with empirical projection matrices recovering task-relevant features.

Figure 5: The Bayes regret of , , and $-greedy policies on classification benchmarks after meta-gradient optimization.</p></p> <h3 class='paper-heading' id='theoretical-considerations-and-strong-claims'>Theoretical Considerations and Strong Claims</h3> <p>The analysis proves concavity of the Bayes reward with respect to exploration horizon in explore-then-commit policies, thus establishing global convergence guarantees for gradient ascent in this restricted setting. Furthermore:</p> <ul> <li>The paper derives, for the first time, the reward gradient for Thompson sampling policies in non-contextual and contextual settings.</li> <li>Empirical evidence suggests near-concavity and unimodality of regret landscapes in all tested differentiable policies.</li> <li>The paper claims that baseline-subtracted gradient estimates allow near-optimal policy learning with orders of magnitude less computational expense compared to classical dynamic programming (Gittins), and greater stability than deep RL methods (DQN).</li> </ul> <h3 class='paper-heading' id='limitations-computational-tradeoffs-and-scaling'>Limitations, Computational Tradeoffs, and Scaling</h3> <p>The most critical limitation is variance in empirical gradient estimation, particularly acute in RNN policies, which motivates further research in scalable control variate methods. Although baseline subtraction and <a href="https://www.emergentmind.com/topics/curriculum-learning" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">curriculum learning</a> yield practical improvements, theoretical convergence outside of the established concave cases remains open. Moreover, the framework assumes the ability to simulate complete trajectories on sampled instances, which may not always be accessible in applied settings.</p> <p>Computationally, the approach is parallelizable and feasible for problems with large$ K $,$ d $, and$ n $—provided memory constraints on$ Kmn$ statistics are observed. Comparing Gittins index computation (days of compute) to meta-gradient policy learning (seconds–minutes), the method presents attractive scalability characteristics.

Implications, Practical Impact, and Future Directions

The meta-learning approach enables:

Automated, data-driven tuning of bandit algorithms for specific application regimes (clinical trials, adaptive recommendations, etc.) where accurate prior modeling is not feasible.
Instance-adaptive exploration guided by empirical prior characteristics as opposed to the worst-case or strictly Bayesian assumptions.
Integration of deep representation learning (e.g., RNNs, context projection) within a theoretically motivated framework.
Immediate extension to complex settings—generalized linear bandits, combinatorial actions, partial monitoring—by designing differentiable policy classes and reward structures.
Figure 6: The Bayes regret of different policies under varying meta-learned subspace configurations.

Research directions include: advancing variance reduction for high-dimensional policies, extending concavity results, and analyzing global convergence for softmax-based meta-learning in broader settings.

Conclusion

This work establishes a rigorous, scalable mechanism for meta-learning bandit policies by gradient ascent over sampled task priors. The integration of differentiable policy classes, theoretical analysis of reward gradients, and empirically validated variance reduction positions the approach as a practical alternative to conventional bandit and RL tuning. The framework's adaptability and performance gains suggest promising avenues for meta-learning exploration strategies in sequential decision-making under uncertainty.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Meta-Learning Bandit Policies by Gradient Ascent

Summary

Meta-Learning Bandit Policies by Gradient Ascent — Expert Summary

Problem Formulation and Meta-Learning Paradigm

Gradient-Based Policy Optimization Algorithm

Differentiable Policy Classes and Their Analysis

Non-Contextual Setting

Contextual Setting

Empirical Results: Regret Minimization and Robustness

Implications, Practical Impact, and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (6)

Collections

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Meta-Learning Bandit Policies by Gradient Ascent

Summary

Meta-Learning Bandit Policies by Gradient Ascent — Expert Summary

Problem Formulation and Meta-Learning Paradigm

Gradient-Based Policy Optimization Algorithm

Differentiable Policy Classes and Their Analysis

Non-Contextual Setting

Contextual Setting

Empirical Results: Regret Minimization and Robustness

Implications, Practical Impact, and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research