Entropy-Regularized Optimal Control
- Entropy-regularized optimal control is a framework that augments classical cost functions with an entropy penalty, such as KL divergence, to balance exploitation and exploration.
- It leverages analytical tools like Malliavin calculus, dynamic programming, and iterative algorithms to derive explicit feedback laws and manage stochastic dynamics.
- This approach is applied in reinforcement learning, robust control, and optimal transport, offering improvements in scalability, policy sparsity, and algorithm convergence.
An entropy-regularized optimal control problem is an extension of classical optimal control in which the objective functional combines traditional state-action costs with a penalty term that measures the divergence between a candidate (controlled) probability law and a reference law, such as the uncontrolled dynamics or a prior. The regularization is typically given by the relative entropy (Kullback–Leibler divergence), its generalizations (e.g., Tsallis entropy), or entropy-like information measures (e.g., transfer entropy). This framework, motivated by both theoretical control and reinforcement learning, enables explicit trade-offs between exploitation (cost minimization) and exploration (policy stochasticity), leading to regularized, randomized controllers with advantageous robustness, tractability, and learning properties.
1. Problem Formulation and Theoretical Structure
Given a controlled stochastic process, the entropy-regularized optimal control problem is generally formulated as
where is the cost random variable, is a reference (typically uncontrolled) probability law, is the controlled measure (absolutely continuous w.r.t. ), is a regularization weight, and is the relative entropy.
Under suitable integrability conditions—specifically, when
an explicit solution exists: the optimal has likelihood ratio
as shown in (Bierkens et al., 2012). This solution captures the minimum of a trade-off between the original cost and the informational distance from the baseline law. In the continuous-time case with Wiener reference measure, every absolutely continuous change of measure is identified via Girsanov’s theorem with a drift control , and the entropy cost acquires the form
The resulting control problem is equivalent to minimizing expected cost plus a quadratic control energy penalty, and the optimal feedback law admits explicit representation using Malliavin calculus or martingale techniques: where is the Malliavin derivative of the cost (Bierkens et al., 2012). This explicit structure generalizes across settings—discrete-time/dynamics, optimal transport, and mean-field games—by varying the entropy measure, the structure of , or the form of .
2. Regularization Types and Sparsity
While the canonical form uses Shannon (relative or differential) entropy, more general regularizers are possible:
- Transfer-entropy regularization: The objective penalizes information flow from states to controls, as in transfer-entropy-regularized MDPs (TERMDPs). This is formulated for Markov Decision Processes as
where is a multi-step transfer-entropy and controls the information cost (Tanaka et al., 2017).
- Tsallis entropy regularization: Here, the classic entropy is replaced by Tsallis entropy, parametrized by , yielding control policies with sparse support (i.e., assigning zero probability to many actions). For Tsallis and regularization parameter , the optimal policy becomes
with bounded support, promoting sparsity compared to the dense softmax form produced by Shannon entropy (Hashizume et al., 4 Mar 2024, Nachum et al., 2018).
- Relative entropy in robust control and games: In robust control or order execution with market uncertainty, the problem’s min-max structure becomes
where is a prior over adversarial/environment actions and quantifies market resilience (Wang et al., 2023).
The type and strength of entropy regularization directly influence the policy’s degree of exploration, the sparsity pattern, and safety-related margin.
3. Methods of Solution and Algorithmic Strategies
The variational structure of entropy-regularized problems allows explicit or semi-explicit characterization of optimal policies:
- Feedback expressions via Malliavin calculus when the cost is smooth (Bierkens et al., 2012).
- Dynamic programming and variational inequalities in continuous/discrete time, leading e.g. to “soft” HJB equations whose minimizer is always (when the reward and system are quadratic/affine) a Gaussian policy:
where is the entropy-regularization parameter (Kim et al., 2020).
- Ricatti equations for LQ problems (with entropy term):
The value function remains quadratic but acquires extra linear terms, and the optimal policy is a Gaussian distribution whose mean and variance are explicitly determined by solution of a modified Riccati equation (Kim et al., 2020, Ito et al., 2023, Hashizume et al., 4 Mar 2024).
- Iterative forward-backward or block coordinate algorithms: In the TERMDP setting, a forward–backward recursion generalized from the Arimoto–Blahut algorithm is employed to find stationary points of the non-convex, entropy-regularized MDP objective (Tanaka et al., 2017).
- Gradient flows in measure spaces: The optimization may be recast as a gradient flow of probability measures in the Wasserstein metric space, with entropy regularization ensuring geometric convergence under convexity (Šiška et al., 2020).
For trajectory optimization and path integral methods, entropy regularization lifts the problem from policy-space to trajectory distribution-space, allowing direct path-consistency learning, sample-based estimation (e.g., Monte Carlo, cross-entropy), or derivative-free evolutionary approaches (Lefebvre et al., 2020, Lefebvre et al., 2021).
4. Applications and Explicit Examples
Entropy-regularized optimal control is applicable in numerous contexts:
- Obstacle avoidance and rare-event steering: Hard constraints are imposed by infinite costs, handled by the explicit Gibbs-type change of measure (Bierkens et al., 2012).
- Mean-variance portfolio optimization with exploration: A continuous-time mean-variance problem with Lévy jumps is regularized by differential entropy of the randomized control, leading to an optimal (Gaussian) distributional control and explicit SDE dynamics, directly linked to the trade-off between learning and performance (Bender et al., 2023).
- Sample-efficient policy learning: Algorithms such as Regularized Policy Gradient (RPG) or Sample-Based RPG deploy entropy regularization to ensure global convergence of policy-gradient optimization, even in non-convex LQ control with multiplicative noise and unknown parameters (Diaz et al., 3 Oct 2025).
- Optimal transport and large-scale control: Large problems are decomposed into entropic optimal transport problems over overlapping subdomains, solved independently (e.g., via Sinkhorn), with entropy regularization guaranteeing linear convergence in KL divergence and enabling parallel/distributed algorithms (Bonafini et al., 2020).
- Entropy-regularized stopping and mean field games: Classical bang-bang stopping is smoothened by cumulative residual entropy or entropy of random stopping probabilities, reformulating the problem as a singular control with “finite fuel,” suitable for policy iteration or fictitious play learning in mean-field games (Dianetti et al., 18 Aug 2024, Dianetti et al., 23 Sep 2025).
5. Regularization Parameters, Limits, and Trade-Off Analysis
The entropy regularization parameter fundamentally controls the trade-off between exploitation (cost minimization) and exploration (policy stochasticity):
- Temperature parameter role: As the regularization strength (, , or ) increases, the policy becomes more randomized (higher entropy, wider support); as it vanishes, the solution converges (often uniformly) to the original, unregularized optimal control, typically deterministic or bang-bang (Bierkens et al., 2012, Dianetti et al., 18 Aug 2024, Ito et al., 2023).
- Exploration-exploitation trade-off: Optimal scheduling or decay of the regularization parameter during learning can guarantee asymptotic optimality with optimal regret rates, e.g., in LQ RL (Szpruch et al., 2022).
- Sparsity vs. robustness: Tsallis entropy allows tuning between strictly sparse and fully exploratory laws depending on the deformation parameter (Hashizume et al., 4 Mar 2024).
- Information-constrained planning: Transfer-entropy or KL penalties formalize limitations of information transmission and are grounded in communication theory and thermodynamics, quantifying the “price of information” (Tanaka et al., 2017).
6. Computational and Algorithmic Implications
Entropy regularization supports computational tractability and scalable algorithm design:
- Explicit feedback and sample-based methods: Many problems admit closed-form feedback, facilitating efficient simulation and Monte Carlo approximations when integrals cannot be computed analytically (Bierkens et al., 2012, Lefebvre et al., 2020).
- Grid-free and multiscale solvers: For “soft” HJB equations, generalized Hopf–Lax formulas enable grid-free dynamic programming, circumventing the curse of dimensionality (Kim et al., 2020).
- Adaptive sparsity and parallelization: Domain decomposition algorithms with entropic regularization show linear convergence, and fast solvers via adaptive sparsity and multi-scale refinement are directly transferable to entropy-regularized control (Bonafini et al., 2020).
- Convergence of RL algorithms: The presence of entropy regularization guarantees global convergence of policy iteration and policy gradient methods under appropriate regularity or convexity, even for measure-valued controls or in backward stochastic settings (Huang et al., 2022, Chen et al., 20 Nov 2024, Diaz et al., 3 Oct 2025).
- Policy iteration and fictitious play: In mean-field or multi-agent settings, entropy-regularized reformulations ensure uniqueness and stability of equilibria, underlying robust fictitious play and model-free learning schemes (Dianetti et al., 23 Sep 2025).
7. Broader Impact, Generalizations, and Connections
Entropy-regularized optimal control is central to contemporary reinforcement learning (e.g., Soft Actor–Critic), robust control, stochastic games, path integral methods, optimal transport, and mean-field game theory. The formulation justifies randomization/exploration in learning agents, smoothens non-convex objectives to facilitate optimization, and provides interpretable Bayesian connections (invariant measure is a Gibbs posterior). The framework unifies variational, dynamic programming, and probabilistic inference approaches and suggests principled regularizations (e.g., Tsallis or transfer-entropy) beyond Shannon entropy tailoring policies for safety, sparsity, or information constraints. Its theoretical tractability and practical benefits, such as explicit feedback laws, scalable algorithms, and convergence guarantees, make entropy-regularized optimal control both foundational and indispensable for modern stochastic control, RL, and large-scale decision-making problems.