Bootstrapped Policy Learning & KL-Regularized Updates

Updated 20 November 2025

The paper presents a unified RL framework that integrates reward-driven policy optimization with KL divergence regularization for stability and principled improvement.
It leverages bootstrapping techniques to reuse value predictions and policy outputs, accelerating learning and reducing variance.
KL regularization constrains the policy from deviating too far from expert or planner references, balancing exploration with exploitation.

Bootstrapped policy learning and KL-regularized updates constitute a unified family of reinforcement learning (RL) algorithms that blend reward-driven policy optimization with information-theoretic regularization. This integration enables principled policy improvement, stability, and adaptability across both value-based and policy-gradient approaches. KL-regularization binds the learned policy to a prior (which may be a previous policy, expert demonstration, planner, or reward-aligned reference) through a Kullback–Leibler (KL) divergence penalty, while bootstrapping—reusing value predictions or policy outputs—accelerates credit assignment and improves sample efficiency. These methodologies underpin contemporary algorithms for policy customization, RL from human feedback, model-based planning, and stable deep RL training.

1. Foundations: Entropy Regularization and Soft Policy Gradients

Entropy regularization in RL augments the standard return with an entropy bonus, promoting exploration and avoiding premature policy collapse. Given policy $\pi_\theta$ , entropy coefficient $\alpha$ , and discounted reward, the entropy-regularized RL objective is

$J(\pi_\theta) = \mathbb{E}_{\tau \sim p_\theta} \left[ \sum_{t=0}^\infty \gamma^t \big(r(s_t, a_t) + \alpha \mathcal{H}[\pi_\theta(\cdot|s_t)]\big) \right].$

This can be rewritten as

$J(\pi_\theta) = \mathbb{E}_{\tau \sim p_\theta} \left[ \sum_{t=0}^\infty \gamma^t \big(r(s_t, a_t) - \alpha \log \pi_\theta(a_t|s_t)\big) \right].$

The soft policy gradient is derived as

$\nabla_\theta J(\pi_\theta) = \mathbb{E}_{(s_t,a_t)} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t)\,A^{\mathrm{soft}}_\theta(s_t, a_t) \right],$

where $A^{\mathrm{soft}}$ incorporates both reward and entropy-adjusted value targets. Clipped PPO-style surrogates are standard for numerical stability (Wang et al., 14 Mar 2025).

This foundation extends naturally to trust-region and KL-regularized frameworks, connecting maximum-entropy RL, regularized policy iteration, and generalized actor–critic methods (Belousov et al., 2019).

2. KL-Regularized Policy Optimization: Formulations, Theory, and Duality

KL-regularized objectives introduce an explicit penalty for deviation from a reference policy $\pi_{\mathrm{ref}}$ , yielding

$J_{\text{KL}}(\pi_\theta) = \mathbb{E}_{\pi_\theta}\left[ \sum_{t=0}^\infty \gamma^t r(s_t, a_t) \right] - \lambda\, \mathbb{E}_{\pi_\theta}\left[ \sum_{t=0}^\infty \gamma^t \mathrm{KL}\big(\pi_\theta(\cdot|s_t)\|\pi_{\mathrm{ref}}(\cdot|s_t)\big) \right].$

Different KL directions (reverse vs forward) and normalizations lead to a spectrum of regularized objectives, impacting diversity–mode-seeking tradeoffs and optimization properties (Zhang et al., 23 May 2025). The optimal policy under reverse-KL admits a Boltzmann form: $\pi^*_\theta(a|s) \propto \pi_{\mathrm{ref}}(a|s)^\beta\,\exp(Q_R^{(\beta)}(s,a)/\beta),$ where $Q_R^{(\beta)}$ is the soft Q-function for the composite reward (Wang et al., 14 Mar 2025). Primal–dual analyses confirm that the “softened” Bellman operator and exponential-weighted policy improvement steps are direct consequences of the KL penalty (Belousov et al., 2019).

Mirror Descent Value Iteration (MDVI) and other KL-proximal methods further ground this framework. In MDVI, alternating bootstrapped (soft) value updates and KL-proximal mirror policy steps provably yield near-minimax sample complexity in the generative-model setting (Kozuno et al., 2022).

3. Bootstrapping: Value, Policy, and Planning

Bootstrapping refers to propagating value predictions or policy statistics forward to accelerate learning and reduce variance. In Q-learning, bootstrapped Bellman updates take the form $Q_{k+1}(s,a) = r(s,a) + \gamma \mathbb{E}_{s'} [V_k(s')]$ , with $V_k$ using a softmax/log-sum-exp value backup under KL regularization (Kozuno et al., 2022).

Policy-gradient methods bootstrap advantage estimates using value networks trained by (potentially KL-regularized) TD or generalized advantage estimation. In token-level RLHF, KL-regularized Q-learning (KLQ) leverages bootstrapped $\lambda$ -returns with Q-networks parameterized as $Q_\theta(s,a) = \tau \log \tfrac{\pi_\theta(a|s)}{\pi_b(a|s)} + V_\theta(s)$ , enabling implicit policy improvement and stable learning (Brown et al., 23 Aug 2025).

In model-based settings, such as the PO-MPC framework, the learned policy is repeatedly bootstrapped from a planner distribution (e.g., an MPPI output) via KL-distillation, followed by regularized RL updates (Serra-Gomez et al., 5 Oct 2025). This process supports adaptation and long-horizon credit assignment while maintaining proximity to a dynamically-improving planning anchor.

4. Policy Customization and Residual Policy Gradient

Residual Q-Learning (RQL) addresses policy customization by shaping new rewards as a sum of basic (prior-policy) rewards and additive modifications. Residual Policy Gradient (RPG) generalizes this to policy-gradient methods, producing augmented rewards of the form

$r^{\text{RPG}}(s, a) = r_R(s, a) + \omega' \log \pi_{\mathrm{prior}}(a|s) - \hat\alpha \log \pi_\theta(a|s).$

The RPG update is

$\nabla_\theta J_{\text{RPG}}(\pi_\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) A_t^{\text{RPG}}],$

where $A^{\text{RPG}}$ uses the above reward. Under specific conditions, RPG recovers the KL-regularized fine-tuning objective and its closed-form Boltzmann policy (Wang et al., 14 Mar 2025). The “bootstrapped” $\log\pi_{\mathrm{prior}}$ term ensures smooth interpolation between mimicking the original policy and optimizing for the new objective.

5. Algorithmic Realizations and Implementation Practices

The following table summarizes key algorithmic instantiations of bootstrapped policy learning and KL-regularized updates:

Algorithm/Framework	Bootstrapping Mechanism	KL Regularization Role
MDVI (Kozuno et al., 2022)	Soft Bellman backup, policy iteration	Mirror descent KL to previous policy (stability, bias)
Soft-PPO (Wang et al., 14 Mar 2025)	Critic-based advantage estimation	Step-wise entropy, optional KL to prior
RPG/Residual PPO (Wang et al., 14 Mar 2025)	Critic, prior-policy bootstrapping	KL penalty via residual reward shaping
KLQ (RLHF) (Brown et al., 23 Aug 2025)	$\lambda$ -returns in Q, Boltzmann Q-param	Per-step reverse KL to SFT reference
PO-MPC (Serra-Gomez et al., 5 Oct 2025)	Planner-policy distillation, Q targets	KL to planner or distilled prior, variable λ
KL-A2C/Actor–Critic (Belousov et al., 2019)	Weighted ML with bootstrapped critic	KL proximal, adjustable divergence family

Typical pseudocode integrates three steps: collect rollouts under the current policy, compute regularized value/advantage targets with reference to a prior or planner, and update the policy using a trust-region or clipped surrogate loss with KL or entropy penalties. In off-policy settings, correct importance weighting and surrogates (including stop-gradient variants and dual-clip mechanisms) ensure unbiased gradients with respect to the intended regularized objective (Zhang et al., 23 May 2025).

6. Empirical Properties and Applications

KL-regularized, bootstrapped algorithms demonstrate robust empirical performance across continuous control, LLM fine-tuning, model-based planning, and human-aligned decision-making tasks. Experimental results in MuJoCo show that step-wise entropy regularization in Soft PPO matches or exceeds other entropy variants, and that Residual PPO achieves Pareto-efficient tradeoffs between reward maximization and prior-policy fidelity (Wang et al., 14 Mar 2025). In RLHF, KLQ ties or outperforms PPO in reward and human-judged outputs, with consistent improvement in win-rate metrics (Brown et al., 23 Aug 2025). PO-MPC shows significant gains in sample efficiency and final returns versus unregularized or planner-only methods, with controlled tuning of the regularization parameter $\lambda$ tailoring the RL–planning tradeoff (Serra-Gomez et al., 5 Oct 2025).

KL-proximal search and regret-minimization frameworks improve human-likeness and coordination in multi-agent games while preserving or exceeding the strength of self-play policies (Jacob et al., 2021).

7. Theoretical Guarantees and Future Directions

Under standard assumptions, KL-regularized bootstrapped RL algorithms guarantee monotonic improvement of the regularized objective, bounded per-update divergence, and convergence to stationary points of the penalized reward functional. Mirror Descent steps with bootstrapped targets yield near-minimax sample complexity without explicit variance-reduction—even in fully model-free regimes (Kozuno et al., 2022).

Choice of KL direction, reference-updating schedule, and bootstrapping mechanism impact convergence, variance, and expressivity. Emerging research continues to explore alternative $f$ -divergences, off-policy and nonstationary extensions, and hierarchical or cross-modal priors. Modular KL-regularized, bootstrapped frameworks are now foundational in adaptive policy transfer, safe RL, RLHF, and hybrid planning–learning systems.

References:

(Wang et al., 14 Mar 2025, Brown et al., 23 Aug 2025, Serra-Gomez et al., 5 Oct 2025, Belousov et al., 2019, Jacob et al., 2021, Zhang et al., 23 May 2025, Kozuno et al., 2022)