Papers
Topics
Authors
Recent
2000 character limit reached

HEPO: Heuristic Enhanced Policy Optimization

Updated 20 November 2025
  • HEPO is a reinforcement learning framework that integrates heuristic rewards while enforcing a constraint to outperform a given heuristic baseline.
  • It reformulates reward augmentation as a constrained maximization problem using a Lagrange multiplier to balance task and heuristic rewards.
  • HEPO demonstrates enhanced robustness and superior empirical performance by preventing reward hacking and ensuring policy improvement in challenging benchmarks.

Heuristic Enhanced Policy Optimization (HEPO) is a reinforcement learning (RL) framework designed to leverage heuristic reward functions while robustly preventing reward hacking and underperformance relative to heuristic policies. Rather than assuming that the optimal solution is invariant to the presence of heuristic rewards, HEPO casts reward augmentation as a constrained maximization problem that prioritizes policy improvement over the heuristic baseline. This approach enables both efficient learning from human priors and performance guarantees relative to pre-designed or concurrently optimized heuristic policies, leading to improved robustness in challenging benchmarks and with non-expert heuristic design (Lee et al., 7 Jul 2025).

1. Formulation of the HEPO Objective

HEPO is defined by introducing a constraint that the learned policy π\pi should never underperform compared to a heuristic policy πH\pi_H with respect to the true task reward. The base objectives are:

  • True task objective: J(π)=Eπ[t=0γtr(st,at)]J(\pi)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^t r(s_t, a_t)\right]
  • Heuristic objective: H(π)=Eπ[t=0γth(st,at)]H(\pi)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^t h(s_t, a_t)\right]

Given a heuristic policy πH\pi_H (pre-trained or trained concurrently), HEPO formulates the optimization problem as:

maximizeπJ(π)+H(π) subject toJ(π)J(πH)\begin{align*} &\text{maximize}_{\pi} \quad J(\pi) + H(\pi) \ &\text{subject to} \quad J(\pi) \geq J(\pi_H) \end{align*}

This constraint is operationalized using a Lagrangian relaxation, resulting in an unconstrained min-max objective:

minα0maxπL(π,α)=J(π)+H(π)+α[J(π)J(πH)]\min_{\alpha \geq 0} \max_{\pi} \,\, \mathcal{L}(\pi, \alpha) = J(\pi) + H(\pi) + \alpha [J(\pi) - J(\pi_H)]

The Lagrange multiplier α\alpha adaptively re-weights the influence of the primary task reward in response to whether the current policy meets or exceeds the performance of the heuristic policy on the true reward.

2. Policy and Multiplier Update Mechanisms

HEPO’s optimization alternates between policy updates and Lagrange multiplier updates, leveraging the following structure:

  • Policy (π) update: For a fixed α\alpha, optimizing π\pi reduces to standard RL with a “mixed” reward function:

r~(st,at)=(1+α)r(st,at)+h(st,at)\tilde{r}(s_t,a_t) = (1+\alpha)r(s_t,a_t) + h(s_t,a_t)

Thus, any standard on-policy RL algorithm (e.g., Proximal Policy Optimization, PPO) can be used directly with the modified rewards.

  • Heuristic policy (π_H) update: Trained on the heuristic reward, with possible cross-pollination of off-policy advantages from π\pi to πH\pi_H and vice versa, increasing sample efficiency.
  • Multiplier (α) update: Adjusts as

α[αη(J(π)J(πH))]+\alpha \leftarrow [\alpha - \eta \cdot (J(\pi) - J(\pi_H))]_+

where []+[\cdot]_+ denotes projection onto R+\mathbb{R}_+, and η\eta is a step size. If the candidate policy outperforms the baseline, α\alpha is reduced, decreasing emphasis on rr; otherwise, α\alpha is increased, up-weighting rr to push J(π)J(\pi) above J(πH)J(\pi_H).

3. High-Level Algorithmic Workflow

The practical implementation of HEPO involves the following iterative process (with on-policy subroutines):

  1. Trajectory collection: Half the trajectories are sampled from the current policy π\pi, half from the heuristic policy πH\pi_H.
  2. Computation of advantages: For both rr and hh using Generalized Advantage Estimation (λ\lambda-GAE) for both policies.
  3. Policy update: Both π\pi and πH\pi_H are updated according to mixed-advantage objectives, leveraging importance sampling ratios for off-policy rollouts.
  4. Lagrange multiplier update: α\alpha is updated via stochastic gradient descent using estimated returns from the most recent rollouts.

This structure enables stable improvement over heuristic-only or invariance-based approaches, particularly when off-policy value estimation is feasible due to shared data collection.

4. Theoretical Considerations and Empirical Behavior

Traditional approaches, such as potential-based reward shaping, design heuristic augmentation to satisfy a policy invariance property—yielding identical optima for the primary task as training with only the primary reward. However, these methods offer no guarantee of improved performance relative to a heuristic policy and may underutilize informative heuristics, especially if the optimal primary reward policy is difficult to discover.

HEPO, by contrast, applies a policy improvement constraint J(π)J(πH)J(\pi) \geq J(\pi_H) at each iteration. This is analogous to conservative policy iteration, where the optimization is tethered to a reference policy. This ensures the updated policy does not degrade below the heuristic baseline and often substantially surpasses it, although no guarantee is made about global optimality with respect to JJ alone under unconstrained updates (Lee et al., 7 Jul 2025).

5. Experimental Evaluation

HEPO was evaluated on extensive RL benchmarks, including IsaacGym locomotion (9 tasks) and Bi-Dexterous-Manipulation (20 tasks), using both well-engineered “expert” heuristic reward functions and heuristics authored by non-expert humans.

Key empirical findings:

  • Normalized return metric: (JXJrandom)/(JH-onlyJrandom)(J_X-J_{\text{random}})/(J_{\text{H-only}}-J_{\text{random}}).
  • Aggregation statistic: Interquartile mean (IQM) of normalized return and probability of improvement (PI) vs. heuristic-only policy across all tasks.
  • Performance summary: HEPO achieved the highest IQM and PI (62%\approx 62\%) over the heuristic baseline (lower-bound confidence interval >50%>50\%) across 29 tasks. In contrast, “J-only” significantly underperformed, while “J+H” methods were highly sensitive to tuning, and other invariance- or constraint-based competitors underperformed.
  • Human heuristics: For a challenging manipulation task (FrankaCabinet) with 12 human-designed heuristics, PPO(H-only) failed in 3 cases, whereas HEPO improved upon all H-only baselines, yielding IQM 0.94\approx 0.94 and PI 73%\approx 73\%.

6. Robustness, Hyperparameters, and Implementation Recommendations

  • RL subroutine: PPO with RL-Games codebase defaults:
    • γ=0.99\gamma=0.99, GAE λ=0.95\lambda=0.95, PPO clip ϵ=0.2\epsilon=0.2
    • Policy/value net learning rate 3×1043\times 10^{-4}; entropy bonus 0\approx 0
    • Rollouts: 2048 steps; minibatch size 256; 10 PPO epochs per iteration
  • Lagrange multiplier: Initialize α0=1.0\alpha^{0}=1.0; LR η103\eta \approx 10^{-3}
  • Batch composition: Even split between π\pi and πH\pi_H rollouts; shared data used for advantage estimation via importance weighting
  • Trial statistics: 5 random seeds per method; 3M (IsaacGym) or 5M (Bi-Dex) steps per run using commodity GPU hardware
  • Hyperparameter sensitivity: Performance is robust to λ\lambda and α\alpha initialization as long as they remain in moderate ranges [0.1,10][0.1,10]
  • Ablation studies: The choice of policy improvement reference (enforcing J(π)J(πH)J(\pi)\geq J(\pi_H) vs. J(π)J(πJ)J(\pi)\geq J(\pi_J), as in EIPO) and joint trajectory sampling both enhanced stability and performance.

7. Context and Implications

HEPO redefines the integration of heuristic signals in RL by prioritizing improvement over a heuristic baseline rather than only optimizing for a combined reward objective or preserving policy invariance. This strategy provides practitioners with a practical guarantee: the learned policy will not regress below the performance level of the best heuristic policy currently available, and can, in practice, acquire significantly better strategies—even when the heuristic is noisy or suboptimal.

A plausible implication is that HEPO facilitates broader adoption of RL in domains where heuristic engineering is expensive or error-prone, as it demonstrably reduces the cost and expertise required for reward design. Furthermore, the HEPO framework is compatible with standard RL pipeline implementations and is agnostic to the source and fidelity of the heuristic reward, making it widely applicable within both academic research and industry practice (Lee et al., 7 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Hierarchy Enhanced Policy Optimization (HEPO).