Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 161 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 471 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Entropy-Regularised MDPs: Theory & Applications

Updated 17 October 2025
  • Entropy-regularized MDPs are defined by the addition of convex penalties, such as conditional entropy, to promote robust, non-deterministic policies.
  • They reformulate Bellman equations into a softmax structure and utilize optimization techniques like mirror descent and dual averaging to unify modern RL algorithms.
  • Empirical studies demonstrate that tuning the regularization parameter balances exploration and exploitation while providing theoretical convergence guarantees.

Entropy-regularized Markov decision processes (MDPs) augment the standard MDP control framework with convex regularization—most notably, (conditional) entropy terms—to promote robust, exploratory, or otherwise non-deterministic policies. This approach achieves a principled trade-off between reward maximization and exploration/exploitation, offering theoretical justification and algorithmic unification for a broad class of contemporary reinforcement learning (RL) algorithms, including Trust-Region Policy Optimization (TRPO) and entropy-regularized policy gradient methods (Neu et al., 2017).

1. Convex-regularized Linear Programming Formulation

Classical average-reward MDPs seek a stationary joint state–action distribution μ\mu^* by solving a linear program:

  • Feasible set: Δ={μΔ(X×A):bμ(y,b)=x,aP(yx,a)μ(x,a), y}\Delta = \{ \mu \in \Delta(\mathcal{X}\times\mathcal{A}) : \sum_b \mu(y,b) = \sum_{x,a} P(y|x,a)\mu(x,a),\ \forall y \}
  • Objective: μ=argmaxμΔx,aμ(x,a)r(x,a)\mu^* = \arg\max_{\mu\in\Delta} \sum_{x,a} \mu(x,a) r(x,a)

The entropy-regularized generalization introduces a convex penalty R(μ)R(\mu) to obtain a smoothed objective:

maxμΔ{x,aμ(x,a)r(x,a)1ηR(μ)}\max_{\mu \in \Delta} \left\{ \sum_{x,a}\mu(x,a) r(x,a) - \frac{1}{\eta} R(\mu) \right\}

Two choices of RR are featured:

  • Relative Entropy (KL divergence) to a baseline μ\mu': RS(μ)=x,aμ(x,a)logμ(x,a)μ(x,a)R_S(\mu) = \sum_{x,a} \mu(x,a) \log \frac{\mu(x,a)}{\mu'(x,a)}
  • Conditional Entropy of the induced policy: RC(μ)=xνμ(x)aπμ(ax)logπμ(ax)R_C(\mu) = \sum_x \nu_\mu(x) \sum_a \pi_\mu(a|x) \log \pi_\mu(a|x), where νμ(x)=aμ(x,a)\nu_\mu(x) = \sum_a \mu(x,a) and πμ(ax)=μ(x,a)/νμ(x)\pi_\mu(a|x) = \mu(x,a)/\nu_\mu(x)

This formalism embeds the standard MDP in a convex-optimization landscape, where entropy regularizers discourage overly deterministic (“peaky”) solutions.

2. Dual Formulation and Regularized BeLLMan Equations

When R=RCR = R_C (conditional entropy regularization), the dual of the regularized LP closely parallels the BeLLMan optimality equations but with a “softmax” structure:

V(x)=1ηlogaπμ(ax)exp{η[r(x,a)λ+yP(yx,a)V(y)]}V(x) = \frac{1}{\eta}\log \sum_a \pi_{\mu'}(a|x) \exp \left\{ \eta\left[ r(x,a) - \lambda + \sum_y P(y|x,a)V(y)\right] \right\}

The optimal average-reward ρη\rho^*_\eta and the optimal advantage function Aη(x,a)=r(x,a)+yP(yx,a)V(y)V(x)A_\eta(x,a) = r(x,a) + \sum_y P(y|x,a)V(y) - V(x) arise naturally in this dual view. The appearance of log-sum-exp (“softmax”) in place of the hard max operator leads to unique, stochastic optimal policies parameterized by η\eta.

This duality is fundamental: it enables rigorous convergence analysis and generalizes dynamic programming to a broader, regularized context.

3. Algorithmic Interpretation: Mirror Descent and Dual Averaging

The regularized control problem can be solved via two archetypal first-order methods that are ubiquitous across optimization and RL:

  • Mirror Descent: This is driven by the Bregman divergence for the regularizer RR:

    μk+1=argmaxμΔ{ρ(μ)1ηDS(μμk)}\mu_{k+1} = \arg\max_{\mu\in\Delta} \left\{ \rho(\mu) - \frac{1}{\eta} D_S(\mu\|\mu_k) \right\}

    For conditional entropy regularization, this update yields:

    πk+1(ax)πk(ax)exp{ηA(πk)(x,a)}\pi_{k+1}(a|x) \propto \pi_k(a|x) \exp\left\{ \eta A_{\infty}^{(\pi_k)}(x,a) \right\}

    The exact mirror descent step coincides (after suitable approximation) with the policy improvement rule in TRPO and the so-called MDP-E algorithm; global convergence to the entropy-regularized optimum is guaranteed.

  • Dual Averaging: Here, algorithms such as the entropy-regularized policy gradient methods (e.g., A3C) update policies by optimizing a sequence of surrogate objectives:

    argmaxπxνπk(x)aπ(ax)[A(πk)(x,a)1ηklogπ(ax)]\arg\max_{\pi} \sum_x \nu_{\pi_k}(x) \sum_a \pi(a|x) \left[ A_{\infty}^{(\pi_k)}(x,a) - \frac{1}{\eta_k}\log \pi(a|x) \right]

    This is equivalent to “follow-the-(regularized)-leader.” However, because the objective is non-convex and the surrogate changes in each iteration, such methods may fail to converge or be trapped in poor local optima—a divergence from the convergence guarantees of mirror descent schemes.

The formalization of these widely used RL algorithms as approximate variants of mirror descent or dual averaging provides both unified theoretical understanding and an avenue for importing algorithmic advances from convex optimization.

4. Empirical Analysis and the Role of the Regularization Parameter

Empirical results presented on a grid-world MDP highlight the sensitivity of learned policies to the trade-off parameter η\eta:

  • Small η\eta (strong regularization) produces highly diffuse, high-entropy policies that under-exploit the reward structure.
  • Large η\eta (weak regularization) leads to “greedy,” low-entropy, often suboptimal policies that may become myopically stuck exploiting an intermediate reward.
  • Intermediate η\eta enables exploration sufficient to discover optimal strategies while still concentrating probability mass on successful trajectories.

Algorithmic comparisons among regularized value iteration (RegVI), dynamic policy programming (DPP), TRPO, and dual averaging methods show that, when properly implemented, dual averaging achieves comparable or slightly better performance than approximate mirror descent variants.

5. Theoretical Implications and Convergence Guarantees

The conditional entropy penalty introduces a softmax structure in the BeLLMan equations, yielding several important theoretical outcomes:

  • Provable convergence of exact TRPO: The exact mirror-descent interpretation establishes that (in its ideal form) TRPO is globally convergent to the optimal entropy-regularized policy, thus subsuming older methods such as MDP-E within a modern RL context.
  • Critique of policy gradient methods: The instability of nonconvex, surrogate-based dual averaging (policy gradient) approaches is made explicit—there is no general fixed-point guarantee.
  • Facilitation of RL algorithm design: Framing entropy-regularized MDPs in convex-optimization terms admits the direct application of methods from online convex optimization, such as Composite Objective Mirror Descent and Regularized Dual Averaging, into RL.
  • Clarification of exploration: The entropy penalty not only regularizes but also drives improved exploration—the dual perspective explains this as a direct byproduct of the “softmaxing” and smoothness imposed by entropy.

This synthesis establishes that conditional entropy regularization, by connecting to convex duality and mirror descent, yields robust theoretical and practical convergence properties, setting reliable guidelines for the design of modern RL algorithms.

6. Summary Table: Regularization, Algorithms, and Properties

Regularizer Algorithmic Family Convergence Guarantee
Conditional Entropy RCR_C Exact TRPO / Mirror Descent Provable (global)
Relative Entropy RSR_S REPS, Policy Mirror Descent Provable under assumptions
Policy Gradient (surrogate obj.) Dual Averaging / A3C No guarantee (nonconvex)

The mapping from regularizer to both algorithm structure and theoretical guarantee is central to understanding the regularized control landscape.

7. Impact and Future Directions

The convex-optimization inspired framework for entropy-regularized MDPs unifies the analysis and design of contemporary RL algorithms. It enables rigorous understanding of convergence and exploration properties, clarifies why and how entropy regularization improves RL control, and highlights the superiority (on theoretical grounds) of mirror descent-based methods over surrogate-based policy gradient techniques.

This unified approach motivates direct translation of advances from convex optimization theory into RL practice, provides a foundation for new classes of robust RL algorithms, and elevates entropy regularization from an empirical “trick” to a mathematically grounded pillar of policy optimization (Neu et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Entropy-Regularised Markov Decision Processes (MDPs).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube