Policy-Improvement Bandits

Updated 27 February 2026

Policy-improvement bandits are a class of algorithms that frame online learning as iterative policy refinement by directly optimizing policy parameters rather than just maximizing rewards.
They employ techniques such as differentiable policy gradients, f-divergence constraints, and variance reduction to achieve provable regret and safety guarantees in dynamic and contextual environments.
These methods bridge classical bandit frameworks and modern reinforcement learning, enabling meta-learning of exploration strategies and the adaptive improvement of existing policies.

A policy-improvement bandit is a class of bandit algorithm that explicitly frames the online learning problem as iterative policy refinement, systematically optimizing either the exploration policy or the improvement of a base policy with direct, theoretically principled connections to regret minimization and stability guarantees. Unlike standard reward-maximizing bandits, policy-improvement bandits target the learning of policies themselves—whether in the sense of meta-learning exploration strategies, optimizing policy updates under information constraints, or adaptively improving legacy policies with provable regret or safety guarantees—by leveraging differentiability, divergence constraints, contextual embedding, and other structural techniques. This approach unifies a range of methodologies, from differentiable policy-gradient meta-bandits and f-divergence policy-iteration, to contextual policy wrappers and conservative bandits, forming a bridge between classical online learning, reinforcement learning, and real-world deployment with practical or ethical constraints.

1. Framing Bandit Problems as Policy-Improvement

In classical bandit settings, the objective is to maximize cumulative reward or minimize regret with respect to the best fixed action in hindsight. Policy-improvement bandits generalize this paradigm by directly parameterizing and optimizing the policy class, often via differentiable or constrained updates, and by sometimes optimizing relative to a baseline or already-deployed policy.

For Bayesian bandits, this meta-learning perspective emerges naturally: one fixes a prior $\mathcal{P}$ over bandit problem instances, defines a parameterized policy $\pi_\theta$ , and seeks to optimize the expected reward or minimize Bayes regret over $\mathcal{P}$ (Boutilier et al., 2020). More generally, the approach translates to structured policy learning in contextual or dynamic environments, as in f-divergence-based policy iteration (Belousov et al., 2017) and constrained improvement over baseline policies (Garcelon et al., 2020).

Key aspects:

Meta-learning policies: Rather than hand-crafting exploration bonuses or heuristics, one optimizes policy parameters directly, potentially across draws from a distribution of bandit problems (Boutilier et al., 2020).
Constraint-driven updates: Policy updates are shaped by structural constraints such as divergence bounds (trust regions), safety margins with respect to baselines, or equilibrium conditions (Belousov et al., 2017, Garcelon et al., 2020, Foster et al., 2023).
Policy-improvement objective: The guiding metric is improvement in policy performance, either globally (regret minimization), locally (per-action improvement), or relative to a baseline.

2. Differentiable Policy-Improvement and Meta-Learning

Differentiability enables the use of policy gradient methods for bandit exploration. In the meta-learning setting, the policy $\pi_\theta$ is parameterized in a differentiable manner (e.g., softmax over statistics, neural networks), and optimized by stochastic gradient ascent on Bayes reward (Boutilier et al., 2020).

Principal elements:

Policy classes: Soft-EXP3, Soft-Elimination (Softelim), and recurrent neural-network policies allow effective gradient estimation. Closed-form gradients are derived for these families, enabling efficient updating (Boutilier et al., 2020).
Variance reduction: The policy-gradient estimator is enhanced with tailored baselines—such as the self-play or optimal hindsight baselines—which yield large variance reductions without needing knowledge of the optimal arm (Boutilier et al., 2020).
Meta-learning algorithm: By sampling problem instances and performing Monte-Carlo gradient steps, the method adapts to the structure of the unknown problem distribution without requiring explicit analytical modeling (Boutilier et al., 2020).
Regret guarantees: For policies such as Softelim, classical $O(\log n)$ regret bounds are shown to hold, while neural policies can approach or exceed Gittins-index performance in complex regimes (Boutilier et al., 2020).

3. f-Divergence Constrained Policy-Improvement

Stability and generalization in policy improvement are addressed by regularizing policy updates with information-theoretic constraints, particularly f-divergences (Belousov et al., 2017). The f-divergence-constrained problem is:

$\max_\pi \mathbb{E}_{a\sim\pi}[r(a)]\quad\text{s.t.}\quad D_f(\pi\;\|\;\pi_\text{old})\leq\delta$

where various choices of f (e.g., Kullback-Leibler, Pearson $\chi^2$ ) yield a spectrum of update schemes:

Alpha-divergence spectrum: Softmax (log-linear) updates emerge for KL, reverse-ratio for KL $^\mathrm{R}$ , and linear updates for Pearson $\chi^2$ . The f-conjugate yields unified closed-form updates for all cases. Empirical evidence indicates that the softmax (KL) parameterization achieves optimal exploration-exploitation trade-off in stationary bandits (Belousov et al., 2017).
Compatible critic/objective: Each f-divergence admits a natural compatibility with a policy evaluation step (Bellman error, log-sum-exp log-likelihood, ratio-matching) (Belousov et al., 2017).
Asymptotic and design properties: As update steps shrink, all such algorithms reduce to natural-gradient ascent on policy simplex with step-size controlled by the curvature $f''(1)$ , highlighting a principled connection between information geometry and policy-updating dynamics (Belousov et al., 2017).

4. Policy-Improvement Bandits for Contextual and Adaptive Policies

Policy-improvement bandits can be used to wrap or enhance existing policies by introducing bandit-driven “tweaks” at carefully chosen decision points. This framework is formalized via equilibrium policy concepts and contextually defined bandit subroutines (Foster et al., 2023).

Features:

Equilibrium definition: A policy is $\epsilon$ -equilibrium if, in hindsight, no alternative that changes only a sparse subset of actions achieves a substantial cumulative reward increase.
Contextual bandit wrapper: Policy improvement is driven by a (possibly model-based) contextual bandit applied only at specific intervals, with the rest of the time following the base policy.
Regret-optimal subroutines: Regret-optimal contextual bandit algorithms ensure that tweaks cannot be systematically improved upon, with regret scaling $O(\sqrt{n})$ in the number of “bandit rounds” (Foster et al., 2023).
Empirical outcomes: This mechanism detects when deployed policies are close to equilibrium and automatically improves policies that are suboptimal, yielding large gains in pathological cases but minimal (certifiable) departures when policies are already good (Foster et al., 2023).

5. Conservative and Safe Policy-Improvement in Bandits

Deployments in risk-sensitive environments require policy-improvement methods that guarantee safety relative to a trusted baseline policy. Conservative bandit algorithms, such as CLUCB2, enforce high-probability constraints to ensure regret does not escalate relative to the baseline (Garcelon et al., 2020).

Technical subtleties:

Conservative constraint: The learner’s cumulative reward over time never falls below a fixed fraction $(1-\alpha)$ of the baseline policy’s reward, with high probability.
Safety set construction: Action selection is restricted to those that, under worst-case estimation uncertainty, maintain the safety threshold.
Checkpointing: Constraints can be relaxed to hold only at pre-specified checkpoints, smoothly interpolating between classical unconstrained bandit and strictly conservative regimes.
Regret bounds: Conservative overhead scales as $O(d^2/\alpha^2)$ , and becomes negligible in the checkpointed setting for sufficiently large intervals, recovering the $O(d\sqrt{n})$ regret of standard LinUCB (Garcelon et al., 2020).

6. Policy-Improvement Bandits in Non-Stationary and Large-Scale RL

Modern reinforcement learning and large-model post-training motivate curriculum learning schemes that cast data-selection or environment-design itself as a policy-improvement bandit problem. The Actor–Curator framework models curriculum selection as a non-stationary bandit with utilities defined by expected policy improvement (Gu et al., 24 Feb 2026).

Key constructs:

Performance-difference-driven utility: At each time step, the utility of a candidate (e.g., a training problem) is defined as its expected policy improvement, precisely quantifiable via the Kakade–Langford performance difference identity.
Non-stationary bandit with partial feedback: Only selected arms produce feedback, and utilities evolve as the actor policy changes.
Online stochastic mirror descent (OSMD): The curation policy is optimized using OSMD with dynamic regret bounds scaling as $O(T^{2/3}V_T^{1/3})$ where $V_T$ is the variation budget.
Function approximation: Scaling to massive problem sets is achieved by neural-policy parameterizations for curation (Gu et al., 24 Feb 2026).

Empirical results show that direct optimization for expected improvement can accelerate convergence and improve final policy performance relative to legacy curriculum heuristics (Gu et al., 24 Feb 2026).

7. Connections to Offline and Contextual Policy-Improvement

Offline policy-improvement bandits address settings where data is collected adaptively or from an existing system, necessitating robust estimation and optimization methods to recover near-optimal policies from historically biased data.

Highlights:

Variance-optimal AIPW estimation: Adaptively weighted doubly-robust estimators provide finite-sample minimax-optimal regret guarantees, even under diminishing exploration rates (Zhan et al., 2021).
Pessimistic regularization: Oracle-efficient pessimistic OPO methods add explicit regularization penalties (e.g., in cost-sensitive classification or regression oracles) to ensure safe improvements over logging policies with sub-Gaussian concentration guarantees (Wang et al., 2023).
Empirical validation: These methods outperform unregularized baseline estimators under data scarcity and high distribution shift, reinforcing the principle that structured policy-improvement and variance control are critical in real-world deployments.

References

Differentiable Bandit Exploration (Boutilier et al., 2020)
f-Divergence constrained policy improvement (Belousov et al., 2017)
Contextual Bandits for Evaluating and Improving Inventory Control Policies (Foster et al., 2023)
Improved Algorithms for Conservative Exploration in Bandits (Garcelon et al., 2020)
Actor-Curator: Co-adaptive Curriculum Learning via Policy-Improvement Bandits for RL Post-Training (Gu et al., 24 Feb 2026)
Policy Learning with Adaptively Collected Data (Zhan et al., 2021)
Oracle-Efficient Pessimism: Offline Policy Optimization in Contextual Bandits (Wang et al., 2023)
Bandit Algorithms for Policy Learning: Methods, Implementation, and Welfare-performance (Kitagawa et al., 2024)