Max Entropy Policy Optimisation

Updated 16 April 2026

Maximum entropy policy optimisation is a reinforcement learning framework that augments reward maximisation with an entropy term to promote robust exploration and prevent overfitting.
It employs soft policy evaluation and policy improvement techniques that ensure convergence, accommodating high-dimensional and non-convex control environments.
Practically implemented via algorithms like Soft Actor-Critic, it leverages temperature tuning and expressive policy classes to enhance sample efficiency and overall performance.

Maximum entropy policy optimisation refers to a family of reinforcement learning (RL), control, and planning algorithms that seek to optimise policies by explicitly maximising entropy, either as a primary objective or as a regularisation term in the optimisation problem. The core principle is to encourage the agent to act as randomly as possible while still solving its task, formalised by maximising a stated reward in conjunction with the entropy of the policy. Maximum entropy objectives are associated with improved exploration, more robust solutions under function approximation or noise, and, in many cases, provable convergence or sample efficiency guarantees.

1. Formal Maximum Entropy Policy Optimisation Objective

In classical RL, the goal is to maximise the expected discounted sum of rewards:

$J_{\mathrm{std}}(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\right]$

The maximum entropy reinforcement learning (MaxEnt RL) formulation augments this by an entropy regularisation term, yielding the objective:

$J(\pi) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma^t (r(s_t, a_t) + \alpha \mathcal H(\pi(\cdot|s_t)))\right]$

Here, $\mathcal H(\pi(\cdot|s)) = \mathbb{E}_{a \sim \pi(\cdot|s)}[-\log\pi(a|s)]$ is the Shannon entropy of the policy at state $s$ , and $\alpha > 0$ is a temperature parameter scaling the trade-off between reward and entropy. As $\alpha \to 0$ the standard RL objective is recovered; larger $\alpha$ generates more stochastic, higher-entropy policies (Haarnoja et al., 2018).

This framework generalises naturally: alternative entropy regularisers can be employed, such as Tsallis entropy for heavy-tailed $q$ -Gaussian policies (Aoyama et al., 2024), or Rényi entropy, as well as entropy measures of the state or trajectory distribution (Islam et al., 2019, Hazan et al., 2018).

2. Algorithmic Foundations and Policy Iteration

Soft Policy Iteration and Contraction

Maximum entropy policy optimisation is implemented via soft policy iteration, alternating between two key steps:

Soft Policy Evaluation: For fixed policy $\pi$ , compute the soft-Q function,

$Q^\pi(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim p(\cdot|s, a)}[V^\pi(s')]$

$J(\pi) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma^t (r(s_t, a_t) + \alpha \mathcal H(\pi(\cdot|s_t)))\right]$ 0

Iterating this soft Bellman operator converges to $J(\pi) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma^t (r(s_t, a_t) + \alpha \mathcal H(\pi(\cdot|s_t)))\right]$ 1 (Haarnoja et al., 2018).

Soft Policy Improvement: Given $J(\pi) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma^t (r(s_t, a_t) + \alpha \mathcal H(\pi(\cdot|s_t)))\right]$ 2, update the policy to minimise the KL divergence from the Boltzmann distribution:

$J(\pi) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma^t (r(s_t, a_t) + \alpha \mathcal H(\pi(\cdot|s_t)))\right]$ 3

This maximises the expected soft-Q value minus the entropy penalty.

The soft policy evaluation and soft policy improvement steps are each contractions in suitable metrics, establishing monotone improvement and convergence to the optimal maximum entropy policy within the function class (Haarnoja et al., 2018).

Practical Algorithms

Gradient-based instantiations replace exact policy iteration with stochastic optimisation on parameterised networks:

Soft Actor-Critic (SAC): Off-policy actor-critic with double-Q critics, a policy network (Gaussian with tanh for actions), and optional value function; actor is regularised by entropy and updated to fit the Boltzmann form (Haarnoja et al., 2018).
Deep Soft Policy Gradient (DSPG): Policy gradient with soft Bellman backups, double-sampling to avoid value bias, and explicit entropy regularisation (Shi et al., 2019).
Soft A2C/A3C, SPPO, STRPO, SIMPALA, etc.: On-policy adaptations using the soft policy gradient theorem, advantage estimation, and entropy rewards (Liu et al., 2019, Choe et al., 2024).

Maximum-entropy actor-critic methods unify or interpolate between value-based RL (soft Q-learning), classical actor-critic, and path-entropy control (Haarnoja et al., 2018, Srivastava et al., 2020).

3. Policy Classes and Expressiveness

The expressiveness of the policy class used in maximum entropy RL influences the solution quality and exploration capacity:

Gaussian Policies: Used in classical SAC and many on-policy methods; unimodal and limited in representing multimodal actions (Haarnoja et al., 2018).
Normalizing Flows: Provide tractable densities, exact entropy, and multimodal action distributions; unify actor and critic via energy-based models (Chao et al., 2024).
Diffusion Policies: Employ score-based generative models to capture highly complex, multimodal distributions, improving exploration in multimodal tasks (Dong et al., 17 Feb 2025).
Polynomial Energy-Based Models: Allow analytic entropy and gradient computation with expressive approximation capacity for arbitrary densities (moment problem) (Liu et al., 19 Feb 2026).
q-Gaussian via Tsallis Entropy: Direct variational moments yield heavy-tailed policies with adaptive variance tuned to the cost-to-go; particularly beneficial in escaping local minima (Aoyama et al., 2024).

The trend is toward increasingly expressive parameterisations that preserve efficient sampling and tractable entropy computation, thus allowing maximum entropy objectives to realise close-to-optimal exploration and robust policies in high-dimensional, non-convex domains.

4. Implementation, Temperature Tuning, and Stabilisation

Maximum entropy methods require careful attention to architecture and hyperparameters:

Design aspect	Standard choices and findings	Source
Q/Value Net Architecture	2–3 layers, 256–512 units, ReLU	(Haarnoja et al., 2018)
Policy Net (actor)	2–3 layers, Gaussian or flow head, tanh	(Haarnoja et al., 2018, Chao et al., 2024)
Optimiser	Adam, lr $J(\pi) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma^t (r(s_t, a_t) + \alpha \mathcal H(\pi(\cdot\|s_t)))\right]$ 4 (typ.)	(Haarnoja et al., 2018)
Target entropy	$J(\pi) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma^t (r(s_t, a_t) + \alpha \mathcal H(\pi(\cdot\|s_t)))\right]$ 5, tuned per environment	(Haarnoja et al., 2018)
Temperature $J(\pi) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma^t (r(s_t, a_t) + \alpha \mathcal H(\pi(\cdot\|s_t)))\right]$ 6	Fixed or learned to match entropy	(Haarnoja et al., 2018)
Batch size, Replay buffer	256, size $J(\pi) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma^t (r(s_t, a_t) + \alpha \mathcal H(\pi(\cdot\|s_t)))\right]$ 7	(Haarnoja et al., 2018)

Automatic tuning of the entropy coefficient $J(\pi) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma^t (r(s_t, a_t) + \alpha \mathcal H(\pi(\cdot|s_t)))\right]$ 8 is achieved by introducing a dual objective and updating $J(\pi) = \mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^\infty \gamma^t (r(s_t, a_t) + \alpha \mathcal H(\pi(\cdot|s_t)))\right]$ 9 to match the average entropy to a target value, further increasing robustness and removing global reward-scale sensitivity (Haarnoja et al., 2018, Liu et al., 2019).

Stabilisation strategies include: double-Q critics to avoid overestimation, clipped double-Q losses (Chao et al., 2024), gradient clipping (Shi et al., 2019), and Polyak averaging of target networks.

5. Generalisations: State and Trajectory Entropy; Exploration

While classical MaxEnt RL maximises policy (action) entropy, several variants optimise entropic quantities of the induced state distribution:

Marginalised State Distribution Entropy: Adds a regulariser $\mathcal H(\pi(\cdot|s)) = \mathbb{E}_{a \sim \pi(\cdot|s)}[-\log\pi(a|s)]$ 0, where $\mathcal H(\pi(\cdot|s)) = \mathbb{E}_{a \sim \pi(\cdot|s)}[-\log\pi(a|s)]$ 1 is the discounted state-visitation distribution, to encourage uniform state coverage; this empirically improves exploration in sparse-reward and partially observed tasks (Islam et al., 2019).
Path/Trajectory Entropy: Optimises entropy over entire paths (not just one-step policy); this recovers exploration strategies with explicit guarantees and can be solved by convex programs or Frank-Wolfe meta-algorithms (Hazan et al., 2018, Savas et al., 2018).
Rollout-Free Steady-State Entropy Methods: Spectral algorithms (EVE) compute the maximum steady-state entropy policy via eigenvector equations, eliminating the need for iterative rollouts or explicit state distribution estimation (Adamczyk et al., 12 Mar 2026).
Non-Parametric Exploration: Policy-gradient algorithms maximising non-parametric (e.g., k-NN) state entropy estimates scale to high-dimensional spaces and do not require probabilistic state density models (Mutti et al., 2020).

These methods highlight the flexible integration of maximum entropy objectives in intrinsic-motivation (exploration) settings, pre-training, or as general-purpose exploration mechanisms.

6. Applications and Empirical Performance

Maximum entropy policy optimisation is empirically validated on benchmark continuous control (MuJoCo: Hopper, Walker2d, HalfCheetah, Ant, Humanoid; Omniverse Isaac Gym) (Haarnoja et al., 2018, Chao et al., 2024, Dong et al., 17 Feb 2025, Liu et al., 19 Feb 2026), combinatorial optimisation (vehicle routing, TSP, CVRP) (Sultana et al., 2020), discrete tabular environments (FrozenLake, Gridworlds) (Hazan et al., 2018, Islam et al., 2019, Adamczyk et al., 12 Mar 2026), and real-world resource allocation (Srivastava et al., 2020). Key findings:

SAC outperforms DDPG, TD3, PPO, and previous MaxEnt and Trust-PCL baselines in final score, sample efficiency, and stability—even on high-dimensional humanoid tasks (Haarnoja et al., 2018).
Normalizing flow and polynomial EBMs deliver strong performance in multimodal environments and high-dimensional robotics, exceeding standard Gaussian policies (Chao et al., 2024, Liu et al., 19 Feb 2026).
Tsallis entropy (ME-DDP) with q-Gaussian policies improves exploration and finds lower-cost solutions in trajectory optimisation with obstacles (Aoyama et al., 2024).
Entropy regularisation on state marginals consistently improves coverage and learning rate in sparse and partially observed environments relative to pure action-entropy regularisation (Islam et al., 2019).
On-policy maximum entropy extensions to PPO/TRPO (with explicit advantage estimation for entropy) improve generalisation and sample efficiency, lowering variance and improving robustness (Choe et al., 2024, Liu et al., 2019).

7. Theoretical Guarantees and Convergence Properties

Maximum entropy policy iteration (soft policy evaluation and improvement) admits the following properties:

Contraction: The soft Bellman operator is a contraction in the value function space, ensuring convergence of fixed-point iterates (Haarnoja et al., 2018, Srivastava et al., 2020).
Monotonic Improvement: Alternating policy evaluation and improvement steps increase the maximum entropy objective at each iteration (Haarnoja et al., 2018).
Optimality in Policy Class: For exact evaluation and improvement steps (tabular, finite policy class), convergence to the optimal maximum entropy policy is guaranteed (Haarnoja et al., 2018).

For convex program and Frank-Wolfe approaches targeting state entropy or trajectory entropy, convergence rates and sample complexities are available, indicating efficiency comparable to classical RL with known polynomial bounds (Hazan et al., 2018).

Posterior-policy iteration and spectral (eigenvector) methods for the unregularized objective likewise admit proof of monotonic improvement and convergence under standard irreducibility assumptions (Adamczyk et al., 12 Mar 2026).

Maximum entropy policy optimisation forms the theoretical and algorithmic underpinning of many modern deep RL methods, providing tools for improved stability, robustness, and exploration via principled inclusion of entropy terms in the policy objective. With the availability of increasingly expressive policy parameterisations and diverse entropic objectives, the framework is widely applicable across continuous, discrete, and combinatorial domains, and continues to generate state-of-the-art results (Haarnoja et al., 2018, Chao et al., 2024, Dong et al., 17 Feb 2025, Aoyama et al., 2024, Choe et al., 2024).