KL-Regularized Soft Update in Reinforcement Learning

Updated 5 February 2026

The paper introduces a KL divergence penalty in soft policy updates, which yields smoother and more stable improvements in reinforcement learning.
It employs a hybrid of forward and reverse KL projections, combining explicit moment matching with gradient descent to optimize policy refinement.
The approach extends to various domains by integrating trust-region constraints and adaptive regularization, demonstrating practical gains in sample efficiency and performance.

KL-regularized soft update is a foundational paradigm in modern reinforcement learning, optimal control, and related areas, wherein a policy is iteratively improved not only to increase expected reward but also to remain close, as measured by Kullback-Leibler (KL) divergence, to a reference or prior policy. The presence of the KL penalty leads to a “soft” update, meaning that policy changes at each iteration are smoothly blended with prior behavior, yielding substantial gains in stability, sample efficiency, and optimization tractability. Recent developments clarify the distinction and synergy between forward and reverse KL regularization, and establish connections to mirror descent, Bayesian updating, and proximal point methods.

1. KL-Regularized Soft Update: Analytical Formulation

In the discrete case, the KL-regularized soft update typically emerges from maximizing a surrogate objective over policies: $J(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^\infty \gamma^t \left(r(s_t, a_t) - \beta \log \frac{\pi(a_t|s_t)}{\pi_0(a_t|s_t)}\right)\right]$ Here $\pi_0$ is a reference or prior policy, $r$ is the task reward, $\gamma$ the discount, and $\beta > 0$ the KL-regularization coefficient. Grouping the KL penalty per state, one can equivalently write: $J(\pi) = \mathbb{E}_{\tau\sim\pi} \left[\sum_{t=0}^{\infty} \gamma^t r(s_t,a_t)\right] - \beta\,\mathbb{E}_{s\sim d^\pi}[D_{\mathrm{KL}}(\pi(\cdot|s)\|\pi_0(\cdot|s))]$ This soft-regularized objective admits a closed-form solution per state via variational calculus. The KL term enforces proximity between $\pi$ and $\pi_0$ , strongly convexifying the optimization landscape and providing a natural trust-region constraint (Wang et al., 14 Mar 2025, Zhao et al., 2024).

A corresponding update for the policy at each state is given, for temperature parameter $\eta$ , by the exponentiated-gradient “soft” policy improvement: $\pi_{\mathrm{new}}(a|s) = \frac{\pi_0(a|s)\exp(\eta Q(s,a))}{Z(s)}$ where $Q(s,a)$ denotes the state-action value and $Z(s)$ is the normalizing partition function (Zhao et al., 2024, Bhole et al., 5 Dec 2025).

2. Reverse and Forward KL: Projections and Moment Matching

The regularized soft-update can be instantiated via either reverse KL ( $D_{\mathrm{KL}}(\pi\|\pi_{B})$ ) or forward KL ( $D_{\mathrm{KL}}(\pi_{B}\|\pi)$ ) projections, depending on whether the policy is projected onto a target (Boltzmann) distribution in a mode-seeking or mass-covering direction. For a Boltzmann backup target

$\pi_{B}(a|s) = \frac{1}{Z(s)}\exp(Q(s,a)/\tau)$

the two projections are defined as

$\pi_{\text{new}}^{(\mathrm{r})} = \arg \min_\pi D_{\mathrm{KL}}(\pi(\cdot|s) \| \pi_{B}(\cdot|s)), \quad \pi_{\text{new}}^{(\mathrm{f})} = \arg \min_\pi D_{\mathrm{KL}}(\pi_{B}(\cdot|s) \| \pi(\cdot|s))$

For parametric Gaussian policies, forward KL admits a closed-form, explicit solution by moment matching: $\mu^*(s) = \int a \;\pi_B(a|s) da, \qquad \Sigma^*(s) = \int (a-\mu^*(s))(a-\mu^*(s))^\top \pi_B(a|s) da$ Reverse KL, lacking a closed form, is approached via stochastic gradient steps and is associated with the standard Soft Actor-Critic (SAC) loss: $J_\pi(\phi) = \mathbb{E}_{s, a \sim \pi_\phi}\bigl[\tau \ln \pi_\phi(a|s) - Q(s, a)\bigr]$ Forward KL is advantageous for explicit projections (stability, rapid progress), while reverse KL retains a theoretical policy improvement guarantee (Zhang et al., 2 Jun 2025).

3. Iterative Algorithms and Two-Phase Updates

The “KL-regularized soft update” is typically operationalized as an iterative procedure, alternating between value estimation and policy improvement. In state-of-the-art RL, such as Bidirectional SAC (Zhang et al., 2 Jun 2025), this manifests in a two-phase update:

Forward-KL Projection: For each state in a minibatch, compute the explicit moment-matching Gaussian parameters by integrating against the Boltzmann target.
Reverse-KL Refinement: Refine the policy parameters via (potentially several) gradient steps on the reverse-KL SAC loss.

This hybrid scheme yields both stability (due to explicit initialization close to the target) and monotonic improvement guarantees (from reverse-KL minimization). Empirically, Bidirectional SAC achieves up to a $30\%$ performance gain and 2–3 $\times$ faster convergence on continuous control benchmarks versus standard SAC (Zhang et al., 2 Jun 2025).

This principle extends to model-based RL with planners as stochastic priors. In Policy Optimization-MPC, a KL-regularized soft update projects the sampling policy toward a softened planner prior via a softmax-weighted combination: $\pi_{\mathrm{new}}(a|z) \propto \pi_{p}(a|z) \exp(Q^{\lambda}(z,a)/\lambda)$ where $\pi_p$ is the planner-induced prior and $\lambda$ controls the regularization (Serra-Gomez et al., 5 Oct 2025).

4. Optimization Geometry, Sample Complexity, and Theoretical Analysis

The addition of the KL penalty fundamentally alters the curvature properties of the policy learning problem. It induces strong convexity in the per-state policy simplex, allowing for sharper sample complexity bounds. KL-regularized contextual bandits and RLHF methods achieve optimality in $O(\eta/\epsilon)$ samples when the KL strength (temperature) $\eta$ is sufficiently large, improving over the standard $O(1/\epsilon^2)$ rate in unregularized settings. This sharpened complexity is a result of the objective's improved curvature and is realizable in practical settings via two-stage mixed-sampling algorithms (Zhao et al., 2024).

5. Extensions, Applications, and Special Cases

KL-regularized soft updates have been systematically adapted to a broad array of settings:

LLM Reasoning and Off-Policy Learning: The Regularized Policy Gradient (RPG) framework provides precise, exact-gradient surrogates for both normalized and unnormalized KL forms and details appropriate importance weighting and clipping for stable off-policy RL in LLMs (Zhang et al., 23 May 2025).
Preference Alignment and Diffusion Policies: In diffusion models for policy optimization, forward-KL-regularized updates prevent out-of-distribution drift and improve alignment with preference data, outperforming both reverse-KL and unregularized approaches for tasks demanding both adherence and adaptation (Shan et al., 2024).
Trust-Region and Mirror Descent Connections: Entropy-regularized mirror descent with a KL-Bregman divergence forms a direct correspondence with Bayesian belief updates, allowing for “trust-decayed” adaptation to distribution shift. The update $x_{t+1} \propto x_t \odot \exp\{-\eta(g_t+\lambda_t \sigma_t)\}$ blends current loss, KL-penalty, and stress-aware tilting (Raj, 17 Oct 2025).
Unified Optimal Control Theory: By separating KL penalties for policies and transition dynamics, recent unifying frameworks analytically derive both soft-optimal control and its convergence to classical (hard) control in the limit of vanishing regularization, admitting path-integral representations and majorization-minimization convergence guarantees (Bhole et al., 5 Dec 2025).

Algorithmic Setting	KL Type	Projection/Update
SAC / actor-critic	reverse / forward	moment matching / gradient descent
PO-MPC	reverse	policy-to-planner soft projection
LLM RPG	reverse / forward	exact-gradient, truncated-IS surrogates
Diffusion policy	forward	D-MSE / DPO with forward KL penalty

6. Empirical Consequences and Practical Tuning

Empirical analyses consistently demonstrate that intermediate KL penalty values ( $\lambda$ or $\beta$ ) offer the best trade-off: high enough to avoid instability and out-of-distribution behaviors, low enough to allow for substantive policy adaptation. Settings that use explicit forward-KL (moment-matching) projections before reverse-KL refinements yield the best sample efficiency and final policy quality, as observed in Bidirectional SAC (up to $30\%$ higher final return on continuous control) and PO-MPC (Zhang et al., 2 Jun 2025, Serra-Gomez et al., 5 Oct 2025). In high-dimensional settings (e.g., LLMs), soft reference-policy updates, gradient-equivalent surrogates, and explicit importance clipping further stabilize optimization and enhance downstream task performance (Zhang et al., 23 May 2025).

Careful selection of the KL penalty coefficient is crucial. Too large a coefficient makes updates overly conservative (barely deviating from the prior), while too small sacrifices stability and can lead to catastrophic deviation or overfitting. Adaptive tuning and hybrid strategies combining explicit projections, trust regions, and stress-aware tilting are effective approaches (Serra-Gomez et al., 5 Oct 2025, Raj, 17 Oct 2025).

7. Theoretical and Practical Significance

KL-regularized soft update unifies maximum entropy RL, modern imitative algorithms, trust-region policy optimization, and Bayesian adaptive control under a single operator-theoretic and variational-principled schema. The framework provides not only practical benefits in terms of sample efficiency, stability, and interpretability, but also connects with deep convex analytical theory (sharp sample complexity bounds, monotone majorization-minimization, convex duality, and linear-solvable path-integral solutions) (Bhole et al., 5 Dec 2025, Zhao et al., 2024). When instantiated with forward and reverse KL projections in tandem, as in Bidirectional SAC, it synthesizes stability with guaranteed improvement—yielding state-of-the-art empirical results across a spectrum of domains (Zhang et al., 2 Jun 2025).