Entropy-Regularised MDPs: Theory & Applications
- Entropy-regularized MDPs are defined by the addition of convex penalties, such as conditional entropy, to promote robust, non-deterministic policies.
- They reformulate Bellman equations into a softmax structure and utilize optimization techniques like mirror descent and dual averaging to unify modern RL algorithms.
- Empirical studies demonstrate that tuning the regularization parameter balances exploration and exploitation while providing theoretical convergence guarantees.
Entropy-regularized Markov decision processes (MDPs) augment the standard MDP control framework with convex regularization—most notably, (conditional) entropy terms—to promote robust, exploratory, or otherwise non-deterministic policies. This approach achieves a principled trade-off between reward maximization and exploration/exploitation, offering theoretical justification and algorithmic unification for a broad class of contemporary reinforcement learning (RL) algorithms, including Trust-Region Policy Optimization (TRPO) and entropy-regularized policy gradient methods (Neu et al., 2017).
1. Convex-regularized Linear Programming Formulation
Classical average-reward MDPs seek a stationary joint state–action distribution by solving a linear program:
- Feasible set:
- Objective:
The entropy-regularized generalization introduces a convex penalty to obtain a smoothed objective:
Two choices of are featured:
- Relative Entropy (KL divergence) to a baseline :
- Conditional Entropy of the induced policy: , where and
This formalism embeds the standard MDP in a convex-optimization landscape, where entropy regularizers discourage overly deterministic (“peaky”) solutions.
2. Dual Formulation and Regularized BeLLMan Equations
When (conditional entropy regularization), the dual of the regularized LP closely parallels the BeLLMan optimality equations but with a “softmax” structure:
The optimal average-reward and the optimal advantage function arise naturally in this dual view. The appearance of log-sum-exp (“softmax”) in place of the hard max operator leads to unique, stochastic optimal policies parameterized by .
This duality is fundamental: it enables rigorous convergence analysis and generalizes dynamic programming to a broader, regularized context.
3. Algorithmic Interpretation: Mirror Descent and Dual Averaging
The regularized control problem can be solved via two archetypal first-order methods that are ubiquitous across optimization and RL:
- Mirror Descent: This is driven by the Bregman divergence for the regularizer :
For conditional entropy regularization, this update yields:
The exact mirror descent step coincides (after suitable approximation) with the policy improvement rule in TRPO and the so-called MDP-E algorithm; global convergence to the entropy-regularized optimum is guaranteed.
- Dual Averaging: Here, algorithms such as the entropy-regularized policy gradient methods (e.g., A3C) update policies by optimizing a sequence of surrogate objectives:
This is equivalent to “follow-the-(regularized)-leader.” However, because the objective is non-convex and the surrogate changes in each iteration, such methods may fail to converge or be trapped in poor local optima—a divergence from the convergence guarantees of mirror descent schemes.
The formalization of these widely used RL algorithms as approximate variants of mirror descent or dual averaging provides both unified theoretical understanding and an avenue for importing algorithmic advances from convex optimization.
4. Empirical Analysis and the Role of the Regularization Parameter
Empirical results presented on a grid-world MDP highlight the sensitivity of learned policies to the trade-off parameter :
- Small (strong regularization) produces highly diffuse, high-entropy policies that under-exploit the reward structure.
- Large (weak regularization) leads to “greedy,” low-entropy, often suboptimal policies that may become myopically stuck exploiting an intermediate reward.
- Intermediate enables exploration sufficient to discover optimal strategies while still concentrating probability mass on successful trajectories.
Algorithmic comparisons among regularized value iteration (RegVI), dynamic policy programming (DPP), TRPO, and dual averaging methods show that, when properly implemented, dual averaging achieves comparable or slightly better performance than approximate mirror descent variants.
5. Theoretical Implications and Convergence Guarantees
The conditional entropy penalty introduces a softmax structure in the BeLLMan equations, yielding several important theoretical outcomes:
- Provable convergence of exact TRPO: The exact mirror-descent interpretation establishes that (in its ideal form) TRPO is globally convergent to the optimal entropy-regularized policy, thus subsuming older methods such as MDP-E within a modern RL context.
- Critique of policy gradient methods: The instability of nonconvex, surrogate-based dual averaging (policy gradient) approaches is made explicit—there is no general fixed-point guarantee.
- Facilitation of RL algorithm design: Framing entropy-regularized MDPs in convex-optimization terms admits the direct application of methods from online convex optimization, such as Composite Objective Mirror Descent and Regularized Dual Averaging, into RL.
- Clarification of exploration: The entropy penalty not only regularizes but also drives improved exploration—the dual perspective explains this as a direct byproduct of the “softmaxing” and smoothness imposed by entropy.
This synthesis establishes that conditional entropy regularization, by connecting to convex duality and mirror descent, yields robust theoretical and practical convergence properties, setting reliable guidelines for the design of modern RL algorithms.
6. Summary Table: Regularization, Algorithms, and Properties
Regularizer | Algorithmic Family | Convergence Guarantee |
---|---|---|
Conditional Entropy | Exact TRPO / Mirror Descent | Provable (global) |
Relative Entropy | REPS, Policy Mirror Descent | Provable under assumptions |
Policy Gradient (surrogate obj.) | Dual Averaging / A3C | No guarantee (nonconvex) |
The mapping from regularizer to both algorithm structure and theoretical guarantee is central to understanding the regularized control landscape.
7. Impact and Future Directions
The convex-optimization inspired framework for entropy-regularized MDPs unifies the analysis and design of contemporary RL algorithms. It enables rigorous understanding of convergence and exploration properties, clarifies why and how entropy regularization improves RL control, and highlights the superiority (on theoretical grounds) of mirror descent-based methods over surrogate-based policy gradient techniques.
This unified approach motivates direct translation of advances from convex optimization theory into RL practice, provides a foundation for new classes of robust RL algorithms, and elevates entropy regularization from an empirical “trick” to a mathematically grounded pillar of policy optimization (Neu et al., 2017).