Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 69 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 37 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 119 tok/s Pro

Kimi K2 218 tok/s Pro

GPT OSS 120B 456 tok/s Pro

Claude Sonnet 4.5 33 tok/s Pro

2000 character limit reached

KL-Regularized MDPs: Foundations & Applications

Updated 12 July 2025

KL-Regularized MDPs are sequential decision-making models that include a KL divergence penalty to balance reward maximization and policy stability.
They modify traditional Bellman operators with smooth, regularized updates that enable robust policy evaluation and efficient learning.
These methods are applied in robotics, queuing networks, and online control, offering resilience to uncertainties in dynamics and rewards.

KL-Regularized Markov Decision Processes (MDPs) are a class of sequential decision-making models in which the optimization objective for an agent includes not only standard reward (or cost) terms but also a regularization term given by the Kullback–Leibler (KL) divergence between the controlled dynamics (or policy) and a reference measure or passive dynamics. This framework has become central in the design and analysis of modern reinforcement learning and control algorithms, where stability, robustness, and efficient exploration are critical. The KL regularization is implemented both as a penalty on deviation from natural or baseline behaviors and as a regularizing function in the modified BeLLMan operators used in planning and learning.

1. Foundations of KL Regularization in MDPs

In KL-regularized MDPs, the agent's decision at each state is augmented by a control cost that quantifies the divergence from a reference or passive policy/dynamics using the relative entropy (KL divergence). The single-step cost typically takes the form

$c(x, u) = f(x) + D(u \,||\, P^*(x,\cdot)),$

where $f(x)$ is the state cost, $u$ is a chosen next-state distribution (action), and $P^*(x,\cdot)$ is the passive (default) transition distribution. The KL divergence is defined as

$D(u \,||\, P^*(x,\cdot)) = \sum_{y} u(y) \log \left( \frac{u(y)}{P^*(x, y)} \right),$

and is always nonnegative, achieving zero only when $u = P^*(x,\cdot)$ . If $u$ puts mass where $P^*(x,\cdot)$ is zero, the cost is infinite, enforcing feasibility constraints on the policy (Guan et al., 2014).

The inclusion of the KL term serves dual roles: it encourages policies close to the passive dynamics and provides a structural regularization that aids tractable computation. In many practical algorithms, the regularizer instead takes the form $\mathrm{KL}(\pi(\cdot|s) \,\|\, \mu)$ for policy regularization relative to some reference distribution $\mu$ (Meunier et al., 27 Mar 2025).

2. Regularized BeLLMan Operators and Dynamic Programming

The KL regularization alters the classical dynamic programming recursion, replacing the hard maximization (or minimization) in the BeLLMan operator with a regularized, often smooth, alternative. For a generic state $s$ , and action $a$ , and for a regularizer $\Omega$ (with KL as a key instance),

$T_{*,\Omega} v(s) = \max_{\pi \in \Delta_\mathcal{A}} \Big\{ \langle \pi, q(s, \cdot) \rangle - \Omega(\pi) \Big\},$

where $q(s,a) = r(s,a) + \gamma \mathbb{E}_{s'|s,a}[v(s')]$ . For negative Shannon entropy regularization, this maximization leads to the softmax (log-sum-exp) form: $T_{*,\Omega} v(s) = \tau \log \sum_{a} \exp \left( \frac{q(s,a)}{\tau} \right),$ and the optimal “soft” policy is

$\pi(a|s) = \frac{\exp \big(q(s, a)/\tau\big)}{\sum_{b} \exp \big(q(s, b)/\tau\big)}.$

This approach yields unique, smooth policies and can be generalized to any strongly convex regularizer $\Omega$ via its Legendre–Fenchel transform (Geist et al., 2019).

When regularization is expressed as a KL divergence relative to a baseline policy $\pi_0$ , the BeLLMan update becomes

$v(s) = \tau \log \sum_{a} \exp \left( \frac{q(s, a) + \tau \log \pi_0(a|s)}{\tau} \right),$

preserving a direct connection to trust region and entropy-regularized RL algorithms.

3. Duality Between Regularization and Robustness

A significant insight in recent research is the equivalence between KL (or entropy) regularization and robustness to model uncertainty. Regularized MDPs, where a penalty is subtracted from the BeLLMan operator, can be shown to be equivalent to robust MDPs with uncertainty in the reward function. Specifically,

$v(s) \leq T_{(P_0,r_0)}^\pi v(s) - \sigma_{R_s}(-\pi_s),$

where $\sigma_{R_s}$ is the support function of an uncertainty set $R_s$ for the reward (Derman et al., 2021, Derman et al., 2023). For regularizers like the KL divergence,

$\Omega(\pi_s) = \sum_a \pi_s(a) \log \frac{\pi_s(a)}{d(a)}$

corresponds to selecting an uncertainty set $R_{s,a}^{(\text{KL})}(\pi) = \log d(a) + [\log(1/\pi_s(a)), +\infty)$ . When both rewards and transitions are uncertain, “twice regularized” (R²) MDPs emerge, leading to a regularization term depending on both policy and value function.

This duality formally connects regularized RL algorithms with robust optimal control approaches, revealing that the use of a KL-regularizer inherently provides resilience to certain model or reward perturbations (Derman et al., 2023).

4. Algorithmic Methodologies and Computational Aspects

KL-regularized MDPs have motivated efficient computational strategies that depart from classical dynamic programming. Key approaches include:

Policy/Value Iteration with Regularized BeLLMan Operators: The regularized operators retain contraction properties, ensuring geometric convergence in planning and learning (Geist et al., 2019).
Online/Regret-Minimization Algorithms: Strategies such as phase-based (“lazy”) updates use the KL cost to enable computationally efficient online learning with provable sublinear regret, as in target tracking problems (Guan et al., 2014).
Bi-level and Two-Timescale Algorithms: Optimization problems arising from projection onto function approximation subspaces (e.g., with linear features) are tackled by bi-level methods. Fast updates approximate BeLLMan backups, while slow updates adjust projections, yielding convergence rates of $O(T^{-1/4})$ under standard assumptions. These frameworks handle both function-approximation and regularization, connecting to soft Q-learning and KL-regularized RL (Xi et al., 26 Jan 2024).
Multilevel Monte Carlo (MLMC) Methods: For high-dimensional or continuous spaces, regularized (soft) BeLLMan operators admit efficient Monte Carlo evaluation. MLMC techniques lower sample complexity bounds, with unbiased (randomized) estimators achieving polynomial sample complexity independent of state/action space size (Meunier et al., 27 Mar 2025).

A representative table contrasts computational properties:

Method	Sample/Iteration Complexity	Suitability
Tabular DP + KL Regular.	Poly(states × actions)	Small finite spaces
MLMC (Unbiased)	Polynomial in accuracy $\epsilon$	Large/continuous spaces
Bi-level Q-Learning	O(T^{-1/4}) (finite time)	Feature-based approximation

5. Empirical Demonstrations and Practical Impact

Empirical studies have validated the practical advantages of KL-regularized MDPs across a range of controlled and real-world scenarios:

Target Tracking on Graphs: KL-regularized online algorithms outperform sampled stationary policies in minimizing cumulative cost and exhibit sublinear regret growth (Guan et al., 2014).
Queuing Networks: Dual LP-based RL methods with low-dimensional feature constraints demonstrate performance improvement over standard heuristics; KL-tempered approaches provide complementary stability (Abbasi-Yadkori et al., 2014).
Online Shopping and Session Management: Regularized policies, especially those with relative entropic (KL) priors, generalize robustly on empirical MDPs derived from user logs, outperforming both unregularized and immediate-reward-based strategies (Gupta et al., 2022).
Robustness to Dynamics and Reward Noise: Twice-regularized (R²) policy iteration and Q-learning maintain robust performance under adversarial changes or estimation errors, with lower computational overhead than explicit max–min robust optimization (Derman et al., 2023).
Kernelized MDPs: Incorporating KL regularization in GP-based RL methods in continuous domains leverages uncertainty quantification for more stable, data-efficient updates (Chowdhury et al., 2018).

6. Theoretical Guarantees and Error Bounds

The general theory for regularized MDPs establishes that:

Modified BeLLMan operators with KL or entropy penalties remain contractive under standard conditions, ensuring existence and uniqueness of value solutions and policy iteration convergence (Geist et al., 2019).
Algorithms based on MDRL (Mirror Descent Reinforcement Learning), including trust region and proximal updates with KL divergence, have explicit error propagation bounds linked to regularization strength and approximation error.
MLMC estimators for soft BeLLMan operators provide error decay rates and complexity guarantees that are independent of the state/action space size, crucial for scalability in continuous domains (Meunier et al., 27 Mar 2025).
In function approximation settings, finite-time guarantees relate the distance between learned and optimal regularized value functions to sample size, approximation class, and inherent bias from regularizer smoothness (Xi et al., 26 Jan 2024).

7. Limitations and Implementation Considerations

While KL regularization imparts robustness and computational tractability, several practical issues arise:

Choice of Reference Measure/Baseline: The effectiveness of KL regularization depends on an appropriate choice of baseline policy or dynamics; inappropriate selection can degrade policy quality or convergence properties (Derman et al., 2021).
Value-Dependent Regularization in R² MDPs: When both reward and transition uncertainties are present, the regularization term becomes value-dependent, complicating policy optimization and possibly necessitating algorithmic modifications (Derman et al., 2023).
Computational Overhead in Large-Scale Settings: While multilevel and bi-level techniques reduce complexity, practical implementation requires careful calibration of sampling and optimization parameters to realize theoretical guarantees (Meunier et al., 27 Mar 2025).
Tuning of Regularization Strength ( $\tau$ ): Excessive regularization leads to overly conservative (or passive) policies, while too little regularization sacrifices stability—a problem highlighted in both synthetic and empirical studies (Gupta et al., 2022).

References Table: Key Papers

Area	Reference [arXiv]	Key Contribution
Online KL-control, regret bounds	(Guan et al., 2014)	Phase-based online learning with KL cost, sublinear regret
Large-scale RL with constraints	(Abbasi-Yadkori et al., 2014)	Low-dimensional dual LP approaches, contrasting KL regularization
ODE approach to KL-MDPs	(Bušić et al., 2016)	ODE-based computation for parametric families of KL-regularized MDPs
Regularized BeLLMan theory, mirror descent	(Geist et al., 2019)	Unified regularization framework, error propagation, and mirror descent
Robustness-regularization equivalence	(Derman et al., 2021, Derman et al., 2023)	R² MDPs, duality to robust control, policy/value-dependent regularization
Bayesian/prior-based regularization	(Gupta et al., 2022)	Relative entropy priors, robustness to empirical model noise
Bi-level Q-learning, finite-time theory	(Xi et al., 26 Jan 2024)	Convergence rate for regularized Q-learning with function approximation
MLMC for KL/entropy regularization	(Meunier et al., 27 Mar 2025)	Polynomial sample complexity for soft BeLLMan operator approximation

Summary

KL-regularized MDPs extend classical models by systematically penalizing deviations from a reference behavior through the KL divergence, affording both computational tractability and robustness. Modern RL algorithms widely utilize these principles to balance exploration and exploitation, stabilize policy updates, and provide resilience to estimation and model errors. Theoretical advances establishing the equivalence of regularization and robustness further unify perspectives from convex optimization, control theory, and modern reinforcement learning. Current research emphasizes scaling these concepts to high-dimensional and continuous domains using variance reduction, functional approximation, and efficient policy iteration schemes. The practical utility of KL-regularized MDPs is validated by applications in robotics, online control, and large-scale decision-making systems.