KL Regularization in RL

Updated 6 March 2026

KL Regularization in RL is a method that penalizes divergence from a reference policy to ensure stable and robust policy optimization.
Key estimators such as on-policy Monte Carlo, importance-weighted, and top-k techniques balance unbiased gradient estimation with computational efficiency.
Practical implementations in RLHF, demonstration-based methods, and hierarchical RL improve sample complexity and drive robust, real-world applications.

Kullback-Leibler (KL) regularization estimators are central to contemporary reinforcement learning (RL), particularly in domains involving expert demonstrations, policy optimization stability, generalization, and safe exploration. By penalizing policy divergence from a reference (often an expert, prior, or behavior policy), KL-based regularizers enforce a constraint that shapes the optimization landscape—impacting sample complexity, estimator design, and enabling robust, theoretically grounded algorithms for both online and offline RL, including reinforcement learning from human feedback (RLHF).

1. Mathematical Foundations and Core Estimators

The KL-regularized RL objective augments the expected return with a penalty term controlling the divergence between the current policy $\pi$ and a fixed reference policy $\pi_0$ (or a behavior-cloned $\pi^{BC}$ ):

$J_{\text{KL}}(\pi) = \mathbb{E}_{\pi}[r(\tau)] - \lambda \mathrm{KL}(\pi \| \pi_0)$

where $\mathrm{KL}(\pi \| \pi_0) = \mathbb{E}_{\tau \sim \pi}\big[\log \pi(\tau) - \log \pi_0(\tau)\big]$ , and $\lambda > 0$ is the regularization strength.

Practically, the KL is computed at sequence, trajectory, or per-decision points, either exactly (tabular/policy iteration), or via estimators in deep RL/sequence models:

On-policy Monte Carlo estimator: Averages per-sample $\log \pi(\tau) - \log \pi_0(\tau)$ over trajectories rolled out by the current policy. This is unbiased but can have high variance, especially for long horizons (Korbak et al., 2022).
Importance-weighted off-policy estimator: Used when samples are from an off-policy behavior distribution $\mu$ ; corrects via ratios $w(\tau) = \pi(\tau) / \mu(\tau)$ (Zhang et al., 23 May 2025).
Density-ratio/classifier-based estimator: Approximates the likelihood ratio via discriminators, but introduces additional estimation bias (Korbak et al., 2022).

For large action spaces (e.g., LLMs), exact KL computation over the full output space is infeasible. Therefore, Top- $k$ KL estimators balance unbiasedness and tractability by explicitly summing over highly probable actions and sampling sparse correction terms for the tail, maintaining unbiased value and gradient estimation for any $k$ (Zhang et al., 4 Feb 2026).

2. KL Regularization in RL Algorithms and Sample Complexity

KL-regularized algorithms modify value iteration, policy gradients, and actor-critic methods to include the penalty in either their policy improvement or value evaluation steps:

Demonstration-Regularized RL (Tiapkin et al., 2023): Leverages KL $\left(\pi \| \pi^{BC}\right)$ , with $\pi^{BC}$ generated from $N^E$ expert demonstrations, to guide exploration and accelerate policy identification. Formally, per-step KL regularization is integrated into Bellman operators:

$V_h^\pi(s) = \pi_h Q_h^\pi(s) - \lambda \mathrm{KL}(\pi_h(s) \| \pi_h^{BC}(s))$

with the global objective $\max_\pi V_1^\pi(s_1) - \lambda \mathrm{KL}_{\text{traj}}(\pi \| \pi^{BC})$ .

Sample Complexity: KL-regularized estimators (demonstration or reference-policy-based) can yield substantially improved rates:
- $\tilde{O}(H^6 S^3 A^2/(N^E \epsilon^2))$ in tabular MDPs, $\tilde{O}(H^6 d^3/(N^E \epsilon^2))$ for linear MDPs, where $N^E$ is the number of expert demos, $S$ states, $A$ actions, $H$ horizon, $d$ feature dimensionality (Tiapkin et al., 2023).
- In RLHF/contextual bandits, KL regularization gives $\mathcal{O}(1/\epsilon)$ sample complexity in small- $\epsilon$ regimes, contrasting the $\mathcal{O}(1/\epsilon^2)$ rate of unregularized RL (Zhao et al., 2024).
Density-Ratio or Bandit-based KL-UCB/LSVI Algorithms: Incorporate KL in an optimism-based Gibbs posterior, e.g.,

$\pi(a|x) \propto \pi_0(a|x) \exp[\eta R(x,a)]$

yielding logarithmic $\mathcal{O}(\eta \log(N_\mathcal{R}T) \cdot d_\mathcal{R})$ regret in online settings (Zhao et al., 11 Feb 2025).

3. Policy Gradient, Mirror Descent, and RPG Estimators

Design choices for KL estimators and their integration into policy gradient methods are critical:

REINFORCE-style with Stop-Gradient: Using the token/trajectory-level KL $\log \pi(a|s) - \log \pi_0(a|s)$ as a reward penalty with gradients passed only through the policy (stop-gradient through the KL term) gives unbiased reverse KL gradients (Shah et al., 26 Dec 2025, Zhang et al., 23 May 2025).
"k1/k2/k3" Penalties: In LLM RL, standard forms include plug-in KL ( $k_1=\log y$ ), forward KL ( $k_2=y-1$ ), and the Schulman estimator ( $k_3=y-\log y-1$ ). Only $k_1$ in reward yields unbiased reverse KL gradients in on-policy settings. Using $k_3$ in loss produces biased (distillation-like) gradients (Shah et al., 26 Dec 2025).
Off-policy Correction and Importance Weights: For unbiased off-policy policy gradients, the KL estimator must be weighted by the likelihood ratio $w(x)=\pi_\theta(x)/\pi_{old}(x)$ . Misweighting (as in prior implementations of GRPO) results in policy collapse or underregularization (Zhang et al., 23 May 2025).
Top- $k$ KL Estimator: Efficient and unbiased for both the KL value and gradient at any $k$ , with $k$ controlling the computation-variance trade-off (Zhang et al., 4 Feb 2026).

4. Extensions: Hierarchy, Differential Privacy, and Geometric Regularization

KL-Regularized Hierarchical RL: Extends the regularizer to multi-level policies (latent variables), enabling the transfer of structure across tasks; the KL becomes a mixture of KLs over high- and low-level controllers (Tirumala et al., 2019).
Differential Privacy (DP) in KL-Regularized RLHF: Under $\epsilon$ -LDP, both offline and online KL-regularized RLHF methods achieve suboptimality gaps and regret bounds scaling as $\tilde{O}(1 / (e^\epsilon-1)^2 n)$ and $O(d_{\mathcal{F}}\log (N_{\mathcal{F}} T) / (e^\epsilon-1)^2)$ respectively, with the KL penalty computed exactly as the reference policy $\mu$ is public (Wu et al., 15 Oct 2025).
Adaptive and Dynamic KL Penalties: Geometric Value Iteration dynamically adapts $\lambda_k$ inversely with error, dampening high-variance steps and yielding weighted-average error guarantees (Kitamura et al., 2021).

5. Theoretical Perspectives: Variational Inference and Control

Variational/Bayesian Perspective: KL-regularized RL can be interpreted as performing reverse-KL variational inference towards the "Boltzmann-posterior" $p^*(\tau) \propto \pi_0(\tau) \exp(r(\tau)/\lambda)$ , preserving diversity and preventing mode collapse relative to standard RL (Korbak et al., 2022).
Mirror Descent and Regularized Bellman Backups: Mirror descent value iteration (MDVI) alternates KL/entropy-regularized policy improvement with sample-based Bellman steps, providing minimax-optimal sample complexity in generative model settings (Kozuno et al., 2022).
Information-Geometric KL Generalizations: Replacing the Fisher-Rao geometry of KL with Wasserstein or Kalman-Wasserstein metrics yields regularizers that avoid KL's singularity under support mismatch or degenerate noise, retaining well-posedness and control robustness in the low-noise limit (Stein et al., 2 Feb 2026).

6. Practical Design Choices and Empirical Effects

Estimator	Unbiased Value	Unbiased Gradient	Off-policy Correction Needed	Computational Cost
Full Trajectory MC	Yes	Yes	Yes (importance weights)	High
Tokenwise $k_1$ ("in reward")	Yes	Yes	Yes	Moderate
Top- $k$ KL	Yes	Yes	Optional (see (Zhang et al., 4 Feb 2026))	Tunable ( $O(k)$ )
$k_3$ /'Schulman' in loss	Yes	No	Yes	Moderate
Density ratio/classifier	No	No	Yes	Varies

Stable On-policy Updates: Use REINFORCE gradient with the log-ratio estimator in the reward; adding the KL estimator to the loss directly without stop-gradient leads to instability and performance collapse (Shah et al., 26 Dec 2025).
Sample Efficiency: KL regularization compresses expert information (when used with demonstrations or SFT policies), turns one-step improvements into quadratic gap-based bounds, and under RLHF, allows two-stage strategies to attain additive (not multiplicative) coverage dependence (Tiapkin et al., 2023, Zhao et al., 2024).
Exploration and Robustness: Adaptive (statewise, error-aware) KL weighting, as well as planner-informed KL priors (e.g., in model-based RL), stabilizes learning and accelerates convergence in high-dimensional domains (Serra-Gomez et al., 5 Oct 2025, Cai et al., 2022).
Limited Mode Coverage: The "mode-seeking" vs. "mass-covering" dichotomy is oversimplified; in practice, both forward and reverse KL typically induce unimodal targets unless reward scaling or explicit multimodal augmentation is performed, such as MARA (mode-anchored reward augmentation) (GX-Chen et al., 23 Oct 2025).

7. Connections and Broader Implications

KL Regularization in RLHF/LLMs: In RLHF for LLMs, reverse-KL regularization to the SFT/reference model is critical for alignment and distributional support control. Correct estimator choice and placement (per-token, in-reward, importance weighting) directly affect sample efficiency, generalization to out-of-distribution tasks, and overall training stability (Shah et al., 26 Dec 2025, Zhang et al., 23 May 2025).
Off-policy and Large-batch Regimes: Large-scale distributed RL for LMs frequently relies on off-policy batching, which makes exact correction and unbiased KL estimation essential for scalable and robust learning (Zhang et al., 23 May 2025, Zhang et al., 4 Feb 2026).
Unified Policy Gradient View: Regularized Policy Gradient (RPG) unifies forward/reverse, normalized/unnormalized KL penalties and their estimator implementations, specifying importance weighting and clipping strategies for unbiased and stable training at scale (Zhang et al., 23 May 2025).

KL-regularized estimators inform practically all state-of-the-art algorithms for RL with demonstrations, RLHF, robust policy optimization, and sample-efficient exploration, and are central to both the current theoretical landscape and emerging scalable applications.