KL-Regularized Reinforcement Learning

Updated 27 October 2025

KL-Regularized RL is a reinforcement learning paradigm that adds a KL divergence penalty to align the learned policy with a reference policy.
It promotes sample efficiency and transfer by integrating prior knowledge, stabilizing training, and enabling hierarchical, modular policy representations.
This approach underpins advances in robotic control, multi-task learning, and language model fine-tuning while inspiring scalable off-policy research.

KL-regularized reinforcement learning (KL-RL) refers to a paradigm in which the learning agent’s policy is explicitly regularized by a Kullback–Leibler (KL) divergence term with respect to a reference or default policy. This approach extends the standard reinforcement learning (RL) framework by incorporating prior knowledge, promoting stability, and facilitating transfer, while also enabling new forms of hierarchical and structured policy representations. KL-regularized RL is used as a foundational component in numerous recent advances across robotic control, transfer/multi-task reinforcement learning, RL from human feedback (RLHF), and large-scale LLM fine-tuning.

1. The KL-Regularized RL Objective and Theoretical Underpinnings

In KL-regularized RL, the standard expected return objective is augmented with a trajectory-level or per-step KL divergence penalty between the agent’s learned (or online) policy π and a reference, prior, or default policy π₀. The generic objective is of the form:

$\mathcal{L}(\pi, \pi_0) = \mathbb{E}_{\tau \sim \pi}\left[\,\sum_{t=0}^\infty \gamma^t\, r(s_t, a_t)\, -\, \alpha \gamma^t\, \mathrm{KL}(\pi(a_t\,|\,x_t) \;\|\; \pi_0(a_t\,|\,x_t))\,\right]$

where γ ∈ [0, 1) is the discount factor and α > 0 governs the trade-off between maximizing reward and remaining close to the reference policy. The expectation is over trajectories generated by π and environment dynamics.

The optimal policy under this objective admits the structure

$\pi^\ast(a_t \,|\, x_t) \propto \pi_0(a_t\,|\,x_t)\, \exp\left(\frac{1}{\alpha} Q^\ast(x_t, a_t)\right)$

so the agent’s policy “tilts” π₀ toward actions with high value, recovering the softmax (Boltzmann) policy when π₀ is uniform. This regularization recovers entropy-regularized RL as a special case and generalizes to more sophisticated settings when π₀ is nontrivial.

KL regularization introduces strong convexity (for reverse KL) to the policy update, stabilizes training, and contracts the update toward the reference, providing both theoretical and empirical benefits in sample efficiency, safety, and generalization.

2. Hierarchical KL-Regularized Policies and Inductive Biases

A salient methodological advancement is the extension of KL-regularized RL to hierarchical policies parameterized by latent variables, yielding a two-level factorization:

$\pi(a_t, z_t|x_t) = \pi^H(z_t|x_t)\,\pi^L(a_t|z_t, x_t)$

where z_t is a high-level (HL) abstract action sampled from π^H and a_t is generated conditionally by a low-level (LL) controller π^L. The default or prior policy π₀ is decomposed analogously.

This hierarchical composition introduces several structural inductive biases:

Separation of Planning and Skill: HL accesses broad task context, choosing abstract actions (e.g., goals, targets), while LL acts based on limited proprioceptive information, enabling plug-and-play transfer of motor skills.
Information Asymmetry: Information asymmetry, wherein HL receives full state or goal information and LL receives only body-centric signals, yields LL controllers that are more task-agnostic and transferable.
KL Decomposition: The total KL can be decomposed and upper bounded by the sum of HL and expected LL divergences:

$\mathrm{KL}(\pi(a_t|x_t) \,\|\, \pi_0(a_t|x_t)) \leq \mathrm{KL}(\pi^H(z_t|x_t) \,\|\, \pi_0^H(z_t|x_t)) + \mathbb{E}_{\pi^H}\left[\,\mathrm{KL}(\pi^L(a_t|z_t,x_t)\,\|\,\pi_0^L(a_t|z_t,x_t))\,\right]$

enabling modular, layer-wise regularization.

Hierarchical KL-regularized structures allow modular training and transfer, with reusable LL primitives across bodies and HL policies easily retargeted to new contexts or morphologies.

3. Empirical Performance: Learning Speed, Transfer, and Modularity

Experimentally, KL-regularized hierarchical RL outperforms both flat entropy-regularized agents and simpler KL methods (such as DISTRAL) in a suite of challenging continuous control benchmarks—spanning locomotion (e.g., “go to one of three targets,” gap-jumping), and manipulation (e.g., box gathering, object relocation).

Noteworthy empirical findings include:

Learning Acceleration: Structured hierarchical agents learning with AR(1) or learned HL priors achieve improved learning speed and higher final reward.
Task Transfer: Pre-trained default policies encoding LL motor knowledge yield rapid adaptation on related tasks with no retraining of HL controllers.
Body Transfer: In simulated morphological transfer (e.g., from Ant to Quadruped or Ball), transferring the HL controller together with KL regularization imposes a shaping signal, improving “body transfer” with dense reward—even when the LL controller must adapt to new actuation dynamics.
Partial Parameter Sharing: Sharing LL network parameters between π and π₀ during training accelerates convergence, particularly in manipulation domains, validating the modularity hypothesis.

Figure 1 in the source demonstrates learning curves illustrating these speed-ups and improvements relative to baselines.

4. Connections to Information Bottleneck, Variational EM, and Structured Transfer

KL-regularized objectives, particularly when the default policy is learned with restricted information or “bottlenecked” state inputs, connect to broader information-theoretic principles:

Limiting the KL divergence between agent and default policies can be interpreted as imposing a channel capacity constraint on the information flow from goal/task variable to action, closely aligned with the conditional mutual information:

$\mathrm{MI}[G_t; a_t \,|\, D_t] \leq \mathbb{E}\left[\log \frac{\pi(a_t|G_t,D_t)}{\pi(a_t|D_t)}\right].$

This enforces cross-task generalization and skill learning in the default.

Alternating updates for π and π₀ mirror the E–M steps in variational EM with the agent as “posterior” and the default as “prior,” compelling the agent to extract reusable structure while reserving goal-specific adaptation for the main policy.

These concepts provide a formal foundation for KL-regularized RL as both RL and joint generative modeling of structured behavior (Galashov et al., 2019).

5. Applications and Broader Implications

KL-regularized RL frameworks with hierarchical and structured default policies are applicable in:

Robotic Motor Control: Modular skill libraries can be encoded in LL controllers, yielding improved robustness and adaptability when faced with novel tasks or robot morphologies.
Multi-Task and Lifelong Learning: Default policies capture recurring behavioral structure, providing scaffolding for rapid learning and transfer across tasks.
Adaptive Systems: In modular robots or wearables, high-level policies may be retained across physical changes while learning new LL control.
Sparse Reward Domains: KL-regularized methods (when the default is correctly structured) grant a dense shaping signal, expediting learning in sparse or delayed-reward environments.

By capturing and enforcing behavioral regularities, these approaches improve sample efficiency, reduce catastrophic policy drift, and support safer exploration and deployment.

6. Limitations, Challenges, and Future Research Directions

Despite empirical success, the framework posits several avenues for substantial development:

Richer Latent Structures: Current hierarchies use relatively simple parametric or autoregressive priors. Exploring non-Gaussian, structured, or learnable latent processes is a promising direction for complex behavior modeling.
Off-Policy and Large-Scale Integration: Developing scalable off-policy algorithms that handle hierarchical latent variables and their credit assignment is an unresolved challenge.
Scaling and Generalization: Systematic benchmarking of transfer properties in large-scale, lifelong, or multi-agent settings remains open.
Information Asymmetry: More deliberate paper of the types and levels of information hidden from LL controllers and their effect on transferability and compositionality is needed.
Compositional Regularization: Integration with alternative regularization frameworks such as distillation and information bottleneck methods may produce further generalization benefits.
Theory–Practice Gap: Understanding how the KL regularizer’s weighting parameter α (or β, as used in other sources) interacts with reward scaling, policy expressivity, and convergence guarantees is largely empirically tuned.

7. Summary

KL-regularized reinforcement learning augments standard RL with a divergence penalty toward a (potentially structured, learned, and hierarchical) reference policy. This supports the imposition of prior knowledge, information sharing across tasks, modular policy decomposition, and improved transfer in multi-task and lifelong settings. Hierarchical latent-variable models, enabled by this regularization, empirically accelerate learning and support structured transfer across tasks and morphologies. The approach unifies ideas from probabilistic generative modeling, information bottleneck theory, and variational inference, and is broadly applicable in robotics, meta-learning, and beyond. Open challenges remain in scaling to richer latent structures, efficient off-policy learning, and fully understanding the dynamics of regularization parameters and information asymmetry in the hierarchical setting (Tirumala et al., 2019, Galashov et al., 2019).

PDF Markdown Chat (Pro)

References (2)

Information asymmetry in KL-regularized RL (2019)

Exploiting Hierarchy for Learning and Transfer in KL-regularized RL (2019)

Follow Topic

Get notified by email when new papers are published related to KL-Regularized Reinforcement Learning.