Heuristic Enhanced Policy Optimization (HEPO)

Updated 4 January 2026

HEPO is a constrained policy optimization framework that integrates human-inspired heuristic rewards to ensure the task return never falls below that of the heuristic policy.
It employs Lagrangian relaxation to adaptively balance task rewards with heuristic signals, resulting in significant empirical improvements across diverse benchmarks.
HEPO offers robustness in reinforcement learning by dynamically updating a multiplier to correct misaligned heuristics without manual reward weighting.

Heuristic Enhanced Policy Optimization (HEPO) is a constrained policy optimization framework for reinforcement learning (RL) that systematically integrates heuristic rewards—dense, human-prior-inspired signals—while provably guaranteeing that the achieved task return never falls below the baseline attained by the “heuristic policy.” HEPO addresses the well-known challenge of reward misalignment and “reward hacking” in practical RL, providing a robust alternative to ad hoc weighting of reward terms by recasting the weighting problem as a constrained maximization task. Notably, HEPO delivers significant empirical improvements across both standard and non-expert-designed benchmarks, and is distinguished by its adaptive, Lagrangian-based approach to weighting heuristics and task rewards (Lee et al., 7 Jul 2025).

1. Constrained Formulation of HEPO

Let $J(\pi) = \mathbb{E}_{\pi}[\sum_{t=0}^\infty \gamma^t r(s_t,a_t)]$ denote the expected task return of policy $\pi$ , and $H(\pi) = \mathbb{E}_{\pi}[\sum_{t=0}^\infty \gamma^t h(s_t,a_t)]$ the expected heuristic return. The “heuristic policy” $\pi_H$ is the policy obtained, e.g., by Proximal Policy Optimization (PPO), that maximizes $H(\pi)$ .

Instead of balancing rewards via a scalar $\lambda$ in the objective $\max_\pi J(\pi) + \lambda H(\pi)$ , HEPO imposes a constraint:

$\boxed{ \begin{aligned} &\max_{\pi}\quad J(\pi) + H(\pi)\ &\text{s.t.}\quad J(\pi) \geq J(\pi_H) \end{aligned} } \tag{1}$

This formulation ensures the optimized policy cannot, at any point, obtain lower task return than the baseline provided by $\pi_H$ . Empirically, well-designed heuristics produce $J(\pi_H) > J(\pi_J)$ , where $\pi_J$ is trained on task reward alone; however, even poorly chosen heuristics are recovered by the mechanism of HEPO’s dynamic weighting.

2. Lagrangian Relaxation and Update Rule

HEPO introduces a nonnegative Lagrange multiplier $\alpha \geq 0$ , yielding the Lagrangian:

$\mathcal{L}(\pi, \alpha) = J(\pi) + H(\pi) + \alpha \left( J(\pi) - J(\pi_H) \right) \quad\Longrightarrow\quad \min_{\alpha \geq 0} \max_{\pi} \mathcal{L}(\pi, \alpha) \tag{2}$

Optimizing with respect to $\pi$ reduces to maximizing expected return under a modified reward:

$r^{\alpha,h}(s,a) = (1+\alpha)r(s,a) + h(s,a)$

The multiplier $\alpha$ is adapted by projected gradient ascent:

$\nabla_{\alpha}\,\mathcal{L} = J(\pi) - J(\pi_H) \tag{4}$

$\alpha$ increases when $J(\pi) < J(\pi_H)$ , amplifying task reward weighting to restore the guarantee. Empirically, after convergence on well-engineered heuristics, $\alpha \rightarrow 0$ , indicating that $H$ is no longer needed after surpassing $\pi_H$ .

The practical instantiation with PPO alternates between sampling under $\pi$ and $\pi_H$ , preparing advantages $A^{\pi}_r, A^{\pi}_h$ and $A^{\pi_H}_r, A^{\pi_H}_h$ , and updating:

$\pi$ : maximize

$L_\pi = \mathbb{E}[\min(\rho(\theta) ((1+\alpha)A^{\rm old}_r + A^{\rm old}_h), \mathrm{clip}(\rho(\theta),1-\epsilon,1+\epsilon) ((1+\alpha)A^{\rm old}_r + A^{\rm old}_h))] \tag{5}$

$\pi_H$ : maximize the surrogate for heuristic-only advantages.

The performance difference lemma enables estimation of $J(\pi) - J(\pi_H)$ using cross-policy advantage rollouts.

3. Theoretical Properties

HEPO guarantees at every iterate $k$ that $J(\pi_{k+1}) \geq J(\pi_H)$ if the Lagrangian is optimized sufficiently, so the learner never underperforms relative to the best available policy trained purely with heuristics. This guarantee stands in contrast to methods based on policy invariance or naive reward addition, which may admit arbitrarily poor $J(\pi)$ if $H$ is poorly set.

Under conservative update regimes (e.g., step size control or $\mathrm{KL}$ trust regions as in TRPO), the standard monotonic improvement bounds are recovered, derived using the performance difference lemma:

$J(\pi') - J(\pi) = \frac{1}{1 - \gamma} \mathbb{E}_{s \sim d^{\pi'},\,a \sim \pi'} [A^{\pi}_r(s,a)]$

A plausible implication is that HEPO possesses the stability of conservative policy iteration, but is robust to heuristic corruptions due to the adaptivity of $\alpha$ .

4. Algorithmic Workflow and Implementation

The HEPO procedure is implemented as follows:

Data collection: At each iteration, collect $B/2$ trajectories with $\pi$ , and $B/2$ with $\pi_H$ .
Advantage estimation: Compute advantages $A_r^{\pi}$ , $A_h^{\pi}$ from $\pi$ data, and $A_r^{\pi_H}$ , $A_h^{\pi_H}$ from $\pi_H$ data.
Policy updates: Optimize $\pi$ via PPO mini-batch updates on the modified (task + heuristic) surrogate, and separately optimize $\pi_H$ with PPO against heuristic-only returns.
Multiplier update: Estimate $\Delta J \approx \frac{1}{2}(\mathbb{E}_\pi[A_r^{\pi_H}] - \mathbb{E}_{\pi_H}[A_r^{\pi}])$ and update $\alpha \leftarrow [\alpha - \eta_\alpha \Delta J]_+$ .
Hyperparameters: Typical values include $B$ (trajectories per iteration), $\epsilon$ (PPO clip), $\gamma$ (discount), $\lambda_{\mathrm{GAE}}$ (GAE parameter), policy learning rate $\eta_\theta = 3 \times 10^{-4}$ , multiplier learning rate $\eta_\alpha = 10^{-3}$ .

Two policies are concurrently trained from shared data via importance sampling, with a network architecture comprising a two-layer 256-unit MLP with ReLU. A typical benchmark requires $\approx 1$ B simulator steps and $5$ random seeds.

5. Heuristic Reward Construction

Heuristic signals $h(s,a)$ encode human priors as dense rewards, facilitating exploration and overcoming sparse-reward learning plateaus. Examples include:

Locomotion (IsaacGym): Forward velocity, foot contact bonuses, joint-torque penalties.
Manipulation (Bi-Dex, FrankaCabinet): Gripper-handle distance (positive), grasp-force contact (positive), action magnitude (negative).

Notably, non-expert-designed heuristics frequently misweight or invert terms, as documented in a human study on FrankaCabinet. Here, 12 graduate students produced functions that sometimes erroneously rewarded moving away from the cabinet. HEPO’s adaptive $\alpha$ downweights deleterious heuristics in these cases, preserving task performance.

6. Empirical Performance and Robustness

HEPO was evaluated on the IsaacGym locomotion suite and Bi-Dex manipulation (29 tasks), as well as on human-designed (non-expert) heuristics. Normalized task return $\tilde{J}_X$ is defined relative to random and heuristic-only policies. Key results include:

Method	IQM $\tilde{J}$	PI (over H-only)
HEPO	0.62	0.62 (95% CI > 0.50)
H-only	0.44	—
J+H	0.40	—
PBRS, HuRL	≈0.0	—
EIPO	0.35	—

On FrankaCabinet with 12 non-expert heuristics, HEPO achieved IQM = 0.94 (vs. H-only 0.44), PI = 0.73, strictly outperforming heuristic-only PPO in 9 out of 12 cases.

Ablations indicate that only the HEPO constraint using $\pi_H$ allows surpassing the heuristic policy. Joint policy/trajectory sampling reduces off-policy error and improves performance relative to alternating collection. Hyperparameter sweeps demonstrate HEPO’s robustness to $\lambda$ and $\eta_\alpha$ ; naïve reward addition is highly sensitive.

7. Practical Implications and Availability

HEPO enables RL practitioners to leverage dense heuristic signals without manual reward weighting, instead relying on a provably constrained formulation that adaptively trades off between task and heuristic rewards. The method preserves or outperforms strong heuristic baselines even under non-expert or misaligned heuristics, with minimal sensitivity to hyperparameter selection. Full implementation details, code, hyperparameters, and learning curves are available at https://github.com/Improbable-AI/hepo (Lee et al., 7 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Going Beyond Heuristics by Imposing Policy Improvement as a Constraint (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Heuristic Enhanced Policy Optimization (HEPO).