MaxEnt-Guided Policy Optimization (MGPO)

Updated 14 November 2025

MGPO is a reinforcement learning and LLM finetuning method that integrates maximum-entropy regularization to guide exploration and improve policy learning.
It extends traditional on-policy actor-critic and reward-weighted frameworks by incorporating an entropy advantage term that dynamically reweights examples near the uncertainty frontier.
Empirical results demonstrate that MGPO achieves faster learning and higher returns with reduced variance and computational cost in both continuous control and LLM settings.

MaxEnt-Guided Policy Optimization (MGPO) is an approach to reinforcement learning (RL) and LLM finetuning that augments standard policy optimization with information-theoretic entropy guidance. It formalizes how maximum-entropy regularization can be integrated with on-policy actor-critic algorithms and reward-weighted RL finetuning, producing substantial empirical gains in exploration, stability, and sample efficiency. There are two canonical settings for MGPO: (1) classical RL for continuous control and generalization tasks, where the entropy bonus is cast as an advantage term; and (2) group-wise RL finetuning for LLMs, where example weights are dynamically adjusted based on their proximity to a maximal-uncertainty "learning frontier."

1. Theoretical Foundations and Motivation

The maximum-entropy RL objective augments the expectation of return with a scaled entropy bonus, promoting diverse action selection and robust exploration: $J(\pi) = \mathbb{E}_{\tau \sim \pi}\left[ \sum_{t=0}^{\infty} \gamma^t \left( r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t)) \right) \right]$ where $\mathcal{H}(\pi(\cdot|s)) = \mathbb{E}_{a \sim \pi(\cdot|s)}[ -\ln \pi(a|s) ]$ and $\alpha > 0$ dictates entropy’s influence (Choe et al., 25 Jul 2024).

In RL for reasoning tasks (e.g., mathematics, scientific code), naïve on-policy sampling often allocates computational effort to trivial or intractable problems—those that the current policy nearly always solves ( $p \approx 1$ ) or never solves ( $p \approx 0$ ). Information theory prescribes maximal Shannon entropy at $p=0.5$ for binary correctness distributions, corresponding to maximal learning potential. MGPO leverages this by measuring the KL divergence from the ideal entropy distribution and reweighting examples accordingly, focusing updates near the epistemic boundary (Xu et al., 9 Nov 2025).

2. Formulation and Formal Objectives

Classical RL (On-Policy Actor-Critic)

Entropy regularization is decoupled from the reward by introducing an entropy Q-function: $Q^H(s, a) = \mathcal{H}(\pi(\cdot|s)) + \ln \pi(a|s)$ with the entropy advantage centered as: $A^H(s,a) = Q^H(s,a) - V^H(s) = \mathcal{H}(\pi(\cdot|s)) + \ln \pi(a|s)$ where the state value for entropy, $V^H(s)$ , is identically zero.

The policy gradient then becomes: $\nabla_\theta J(\pi_\theta) = \mathbb{E}_{s,a \sim \pi_\theta}\left[ \nabla_\theta \ln \pi_\theta(a|s) \left( A^R(s,a) + \alpha A^H(s,a) \right) \right]$ where $A^R(s,a)$ is the standard reward advantage (typically estimated via GAE).

LLM Finetuning (Group-Wise Reward Policy Optimization)

Formally, let $q$ denote a question, with $G$ sampled rollouts, and binary rewards $r_i \in \{0,1\}$ . Define empirical correctness: $p_c(q) = \frac{1}{G} \sum_{i=1}^G \mathbb{I}[ r_i = 1 ]$ The maximum entropy reference is $p_0 = 0.5$ , and the deviation is quantified via KL divergence: $D_{\mathrm{ME}}(p_c(q) \Vert p_0) = p_c(q)\ln\frac{p_c(q)}{p_0} + (1-p_c(q))\ln\frac{1-p_c(q)}{1-p_0}$ The entropy-modulated weight: $w_{\mathrm{ME}}(p_c(q)) = \exp(-\lambda D_{\mathrm{ME}}(p_c(q) \Vert p_0))$ where $\lambda$ is an entropy-weighting hyperparameter.

For token-level group-relative advantage (GRPO), set

$\mathcal{A}_{i,t}(q) = \frac{r_i - \mu_\mathcal{G}}{\sigma_\mathcal{G}}, \quad \mathcal{A}'_{i,t}(q) = w_{\mathrm{ME}}(p_c(q)) \cdot \mathcal{A}_{i,t}(q)$

and optimize the clipped surrogate: $\mathcal{J}_{\mathrm{MGPO}}(\theta) = \mathbb{E}_{(q, \{y_i\})}\left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \min \left(r_{i,t}(\theta) \mathcal{A}'_{i,t}(q),\, \operatorname{clip}(r_{i,t}(\theta),1-\varepsilon,1+\varepsilon) \mathcal{A}'_{i,t}(q) \right) \right]$ with $r_{i,t}(\theta)$ the usual policy ratio.

3. Algorithmic Workflow and Pseudocode

RL (PPO-Style)

Input: policy params θ, value-critic params φ, entropy coeff α, GAE λ, clip ε, epochs K, batch size B
Initialize θ, φ
for each iteration:
    Collect rollout of T timesteps {s_t, a_t, r_t, π_old(a_t|s_t)}
    Compute V^R_φ(s_t), compute A^R_t via GAE
    For each (s_t, a_t): H_t = -ln π_old(a_t|s_t)
                         Ḣ_t = E_{a'~π_old}[-ln π_old(a'|s_t)]
                         A^H_t = Ḣ_t - H_t
    Form combined advantage: A^T_t = A^R_t + α*A^H_t
    for epoch in 1..K:
        Sample minibatch of B
        Compute ratio ρ_t = π_θ(a_t|s_t) / π_old(a_t|s_t)
        Loss L^PPO = E_t[ min(ρ_t*A^T_t, clip(ρ_t,1−ε,1+ε)*A^T_t) ]
        Critic loss L^V = E_t[(V^R_φ(s_t) − R_t)^2]
        θ ← θ − η·∇_θ(−L^PPO), φ ← φ − η·∇_φ L^V

Input: SFT model θ, rollout size G, clip ε, entropy weight λ
repeat for N RL updates:
    θ_old ← θ
    collect minibatch of M questions {q_j}
    for q_j in minibatch:
        sample G answers {y_{j,i} ∼ π_{θ_old}(·|q_j)}
        compute rewards r_{j,i} ∈ {0, 1}
        p_c = (1/G)∑_i r_{j,i}
        D_ME = p_c·ln(p_c/0.5) + (1–p_c)·ln((1–p_c)/0.5)
        w_ME = exp(–λ·D_ME)
        μ_G, σ_G of {r_{j,i}}
        for each rollout i, each token t in y_{j,i}:
            GRPO advantage A_{j,i,t} = (r_{j,i}–μ_G)/σ_G
            weighted adv. A'_{j,i,t} = w_ME · A_{j,i,t}
            ratio r_{j,i,t}(θ) = π_θ(y_{j,i,t}|…)/π_{θ_old}(…)
            L_{j,i,t} = min(r_{j,i,t}*A'_{j,i,t}, clip(r_{j,i,t},1–ε,1+ε)*A'_{j,i,t})
    θ ← θ + α ∇_θ E[ sum_{j,i,t} L_{j,i,t} ]

4. Practical Implementation Details

In continuous control (e.g., MuJoCo), effective settings include entropy coefficient $\alpha\in\{0.005, 0.01, 0.02, 0.05\}$ (with best trade-off typically at 0.01), GAE $\lambda=0.95$ , discount $\gamma=0.99$ , PPO clip $\epsilon=0.2$ , rollout $T=2048$ , minibatch $B=64$ , epochs $K=10$ , Adam step size $3\times10^{-4}$ . TRPO-style variants use conjugate gradient (10 iterations) and KL constraint $\delta=0.01$ (Choe et al., 25 Jul 2024).

For LLM RL-finetuning, the entropy weight $\lambda$ is tuned empirically, often yielding orders-of-magnitude improvement when moving from $\lambda=0$ ("pure" GRPO) to moderate positive values. The method assumes upstream SFT initialization with broad solution diversity, as produced by "Two-Stage Diversity-Exploring Distillation" and "Expert Model Fusion". The RL phase (MGPO) then efficiently reallocates probability mass from low-certainty rollouts toward correct, high-information solution modes (Xu et al., 9 Nov 2025).

5. Empirical Results and Benchmarks

On MuJoCo tasks (Hopper, Walker2d, HalfCheetah, Ant), MGPO with PPO achieves up to 1.5× faster learning and 10–20% higher final returns compared to vanilla PPO, with reduced variance and enhanced stability across seeds. In Procgen (CoinRun, Jumper, Heist, Fruitbot, Maze), MGPO yields 5–15% absolute test-time gains and smaller generalization gaps, with final level completion rising from ∼60% (PPO) to ∼70% (MGPO) (Choe et al., 25 Jul 2024).

In the LLM regime, the VibeThinker-1.5B model (trained with MGPO after SFT) achieves:

Benchmark	MGPO Score (%)	DeepScaleR (%)	Δ (%)
AIME24	80.3	43.1	+37.2
AIME25	74.4	31.5	+42.9
MATH500	95.0	87.8	+7.2
HMMT25	50.4	19.0	+31.4
LiveCodeBench v5	55.9	16.3	+39.6
LiveCodeBench v6	51.1	12.8	+38.3

The total RL cost for VibeThinker-1.5B is $7.8K—substantially lower than benchmarks such as DeepSeek-R1 ($294K) or MiniMax-M1 ($535K). Ablation confirms that settings with $\lambda=0 $achieve only 31.5% on AIME25, while MGPO ($ \lambda>0 $) reaches 74.4%, isolating entropy reweighting as the critical driver for the observed leap in reasoning accuracy (<a href="/papers/2511.06221" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Xu et al., 9 Nov 2025</a>).</p> <h2 class='paper-heading' id='theoretical-insights-and-mechanistic-implications'>6. Theoretical Insights and Mechanistic Implications</h2> <p>By expressing the entropy bonus as an advantage term, MGPO produces action- and example-specific exploration feedback. In RL, high-probability (predictable) actions receive negative entropy advantage, while low-probability (surprising) actions are positively reinforced, steering the agent toward underexplored modes. Centering the entropy contribution reduces update variance, mitigating instability associated with naive reward-entropy summation.</p> <p>In LLM settings, MGPO explicitly concentrates gradient density on examples near maximal uncertainty (empirical$ p_c \approx 0.5$), aligning policy updates with the "learning frontier." The KL-based weighting decays deviation from this frontier, adapting the effective "curriculum" online. This suggests that MGPO instantiates an information-theoretic curriculum learning mechanism, focusing capacity and RL compute on the most instructive, policy-improvable examples.

A plausible implication is that MGPO's efficiency stems from its ability to identify and amplify statistically rare, high-signal trajectories within a diverse solution reservoir—a process that is especially critical for smaller models extracted from broad, multi-expert SFT initializations.

7. Connections, Limitations, and Broader Impact

MGPO builds upon and generalizes classic maximum-entropy RL and policy optimization frameworks by operationalizing entropy not as a global bonus but as a targeted, advantage-shaped intervention or sample weight. It is compatible with standard on-policy algorithms (PPO, TRPO) and reward-weighted LLM RL settings (GRPO), requiring minimal additional computational overhead.

Empirical evidence demonstrates competitive or superior performance at a fraction of the training cost for both continuous control agents and moderately sized LLMs. Deployment hinges on robust SFT/behavioral cloning for solution diversity, and on careful tuning of entropy or KL-weighting hyperparameters. While gains are pronounced in settings with high solution diversity and reward sparsity, performance in deterministic or low-entropy domains may be less differentiated from classical approaches.

MGPO thus provides a principled, computationally efficient path for exploiting maximum-entropy principles within both RL and large-scale finetuning, with broad applicability to exploration-challenging, generalizability-critical domains (Choe et al., 25 Jul 2024, Xu et al., 9 Nov 2025).

PDF Markdown Chat (Pro)

References (2)

Maximum Entropy On-Policy Actor-Critic via Entropy Advantage Estimation (2024)

Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B (2025)

Follow Topic

Get notified by email when new papers are published related to MaxEnt-Guided Policy Optimization (MGPO).

MaxEnt-Guided Policy Optimization (MGPO)

1. Theoretical Foundations and Motivation

2. Formulation and Formal Objectives

Classical RL (On-Policy Actor-Critic)

LLM Finetuning (Group-Wise Reward Policy Optimization)

3. Algorithmic Workflow and Pseudocode

RL (PPO-Style)

LLM RL-Finetuning (MGPO) (Xu et al., 9 Nov 2025)

4. Practical Implementation Details

5. Empirical Results and Benchmarks

7. Connections, Limitations, and Broader Impact

Follow Topic

Continue Learning

MaxEnt-Guided Policy Optimization (MGPO)

1. Theoretical Foundations and Motivation

2. Formulation and Formal Objectives

Classical RL (On-Policy Actor-Critic)

LLM Finetuning (Group-Wise Reward Policy Optimization)

3. Algorithmic Workflow and Pseudocode

RL (PPO-Style)

LLM RL-Finetuning (MGPO) (Xu et al., 9 Nov 2025)

4. Practical Implementation Details

5. Empirical Results and Benchmarks

7. Connections, Limitations, and Broader Impact

Follow Topic

Continue Learning

Related Topics