KL Divergence Control in MDPs

Updated 30 June 2025

KL Divergence Control is a framework that embeds KL divergence in control problems to penalize deviations from a reference process in MDPs.
It reformulates the Bellman equation into a linear eigenproblem, facilitating tractable computation of optimal policies.
The KL-learning algorithm uses local, online updates to ensure stable convergence and computational efficiency in average-cost reinforcement learning.

Kullback-Leibler (KL) Divergence Control refers to the set of theoretical and algorithmic frameworks that incorporate the Kullback-Leibler (KL) divergence into the objective of control problems, especially Markov decision processes (MDPs). In these frameworks, the KL divergence penalizes deviations of the controlled system from an underlying (often uncontrolled) Markov process. This results in a class of control problems—KL control or Kullback-Leibler control—in which the trade-off between achieving state-dependent costs and maintaining similarity to a reference (uncontrolled) process is made explicit via information-theoretic regularization.

1. Foundations of KL Divergence Control

KL divergence, defined as

$KL(p \| q) = \sum_j p(j) \ln \frac{p(j)}{q(j)},$

quantifies the dissimilarity between two probability distributions $p$ and $q$ . In KL control, this quantity is directly embedded in the cost function to regularize against drastic changes from reference system dynamics.

In the context of Markov decision processes with finite state space:

Each control action $p(\cdot|i)$ (a stochastic transition from state $i$ ) is penalized for diverging from the uncontrolled transition $q(\cdot|i)$ by an additive term $\frac{1}{\beta} KL(p(\cdot|i)\| q(\cdot|i))$ .
The total per-step cost is then

$c(j|i) + \frac{1}{\beta} KL\big(p(\cdot|i) \| q(\cdot|i)\big),$

where $c(j|i)$ is the immediate cost of transitioning from $i$ to $j$ , and $\beta>0$ balances state and control costs.

This regularizes the solution, ensuring the controlled process remains "close" to the behavior of the uncontrolled system, which can aid tractability, encourage exploration, and produce smoother policies.

2. KL-Learning Algorithm: Stochastic Approximation for KL Control

The KL-learning algorithm addresses the ergodic (average-cost) KL control problem using a stochastic approximation approach appropriate for online, sample-based scenarios.

Average-Cost Formulation and Eigenproblem

The average-cost Bellman equation for state $i$ is

$\rho + \Phi(i) = \min_{p(\cdot|i)} \sum_j p(j|i) \left[ c(j|i) + \frac{1}{\beta} \ln\left( \frac{p(j|i)}{q(j|i)} \right) + \Phi(j) \right].$

The optimal solution reduces to an eigenproblem for the matrix

$h_{ij} = \exp(-\beta c(j|i))\, q(j|i),$

finding the Perron-Frobenius eigenpair $(z^*, \lambda^*)$ : $H z^* = \lambda^* z^*, \qquad \|z^*\|_1 = \lambda^*.$

Online Algorithm (KL-learning)

The KL-learning algorithm operates using only incremental, local updates on individual transitions:

Initialize $z(i)$ , $\lambda$ , and set the starting state.
For each observed transition $x \rightarrow y$ $x \to y$ ,
- Update per
$\Delta = \exp(-\beta c(y|x)) \frac{z(y)}{\lambda} - z(x),$

$z(x) \leftarrow z(x) + \gamma \Delta, \qquad \lambda \leftarrow \lambda + \gamma \Delta,$

where $\gamma$ is a gain parameter.
Proceed to the next state $x \leftarrow y$ .

This process approximates the associated ODE system, converging (almost surely under standard hypotheses) to the principal eigenpair of $H$ , and thus the optimal value function for the KL-regularized control problem.

3. Theoretical Properties and Algorithm Comparison

The main theoretical insights are:

The induced Bellman equation retains a linear (in $z$ ) structure due to the properties of the KL divergence and the exponential family.
The ODE method offers a rigorous convergence analysis: the discrete updates approximate a deterministic ODE characterized by the invariant measure of $q$ and the control-induced dynamics.
The equilibrium point of the ODE corresponds to the unique positive eigenpair of $H$ .

Comparison to other methods:

Power Method: Requires full access to $H$ ; each step involves a dense matrix-vector multiply. KL-learning achieves similar convergence using only local, sample-based updates.
Z-learning: Stochastic approximation method for discounted/absorbing MDPs with KL cost but assumes $\lambda^*=1$ ; KL-learning does not require such an assumption and is suitable for ergodic (average-cost) problems.

4. Empirical Performance and Practical Considerations

Experiments (e.g., gridworld navigation) show that KL-learning converges rapidly and stably, attaining comparable or superior value function and policy convergence compared with power iteration and Z-learning, especially when only local transition information is available or the problem is large-scale.

Practical implications:

Computational efficiency: Each step is $O(1)$ ; no need for global matrix storage or computation.
Model-free reinforcement learning: Supports sample-based, online RL in average-cost MDPs.
Parameter tuning: Step size schedule $\gamma$ requires careful choice for convergence.
Robustness: The method performs well under mere ergodicity assumptions (no absorbing state needed).

Limitations include the lack of global theoretical convergence for all problem instances, though local stability is established in many cases.

5. Implications for Reinforcement Learning

KL divergence control naturally connects to key themes in reinforcement learning:

Exploration-exploitation: The KL term regularizes policy updates, maintaining a degree of stochasticity ("soft" control).
Sample efficiency: Algorithms operate in a model-free manner, using only experience samples.
Scalability: Stochastic and local updates permit use in large state spaces or with partially known dynamics.
Foundation for further research: The KL-learning framework could underpin hierarchical or deep RL schemes where regularization by divergence is a central design element.

6. Key Equations and Summary Table

Concept	Mathematical Expression
KL cost per step	$KL(p(\cdot\|i)\\|q(\cdot\|i)) = \sum_j p(j\|i)\ln\frac{p(j\|i)}{q(j\|i)}$
Bellman (average-cost) with KL	$\rho + \Phi(i) = \min_{p(\cdot\|i)} \sum_j p(j\|i)[c(j\|i) + \frac{1}{\beta}\ln\frac{p(j\|i)}{q(j\|i)} + \Phi(j)]$
Optimal transition p*	$p^(j\|i) = q(j\|i) \exp(-\beta c(j\|i)) \frac{z_j^}{\lambda^* z_i^*}$
KL-learning update	$\Delta = \exp(-\beta c(y\|x))\frac{z(y)}{\lambda} - z(x)$
Eigenproblem for H	$H z^* = \lambda^* z^*, \ h_{ij} = \exp(-\beta c(j\|i))q(j\|i)$

KL divergence control frameworks, and specifically the KL-learning algorithm, provide theoretically principled, sample-based methods to solve control problems in a way that incorporates information-theoretic regularization of the controller. This approach yields efficient, local, and online algorithms for average-cost ergodic MDPs, with direct applications to reinforcement learning in large and complex environments. The convergence and stability properties are supported by ODE theory, and empirical studies confirm robust and efficient learning and control in practical scenarios.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to KL Divergence Control.