Papers
Topics
Authors
Recent
Search
2000 character limit reached

Heuristic Enhanced Policy Optimization (HEPO)

Updated 4 January 2026
  • HEPO is a constrained policy optimization framework that integrates human-inspired heuristic rewards to ensure the task return never falls below that of the heuristic policy.
  • It employs Lagrangian relaxation to adaptively balance task rewards with heuristic signals, resulting in significant empirical improvements across diverse benchmarks.
  • HEPO offers robustness in reinforcement learning by dynamically updating a multiplier to correct misaligned heuristics without manual reward weighting.

Heuristic Enhanced Policy Optimization (HEPO) is a constrained policy optimization framework for reinforcement learning (RL) that systematically integrates heuristic rewards—dense, human-prior-inspired signals—while provably guaranteeing that the achieved task return never falls below the baseline attained by the “heuristic policy.” HEPO addresses the well-known challenge of reward misalignment and “reward hacking” in practical RL, providing a robust alternative to ad hoc weighting of reward terms by recasting the weighting problem as a constrained maximization task. Notably, HEPO delivers significant empirical improvements across both standard and non-expert-designed benchmarks, and is distinguished by its adaptive, Lagrangian-based approach to weighting heuristics and task rewards (Lee et al., 7 Jul 2025).

1. Constrained Formulation of HEPO

Let J(π)=Eπ[t=0γtr(st,at)]J(\pi) = \mathbb{E}_{\pi}[\sum_{t=0}^\infty \gamma^t r(s_t,a_t)] denote the expected task return of policy π\pi, and H(π)=Eπ[t=0γth(st,at)]H(\pi) = \mathbb{E}_{\pi}[\sum_{t=0}^\infty \gamma^t h(s_t,a_t)] the expected heuristic return. The “heuristic policy” πH\pi_H is the policy obtained, e.g., by Proximal Policy Optimization (PPO), that maximizes H(π)H(\pi).

Instead of balancing rewards via a scalar λ\lambda in the objective maxπJ(π)+λH(π)\max_\pi J(\pi) + \lambda H(\pi), HEPO imposes a constraint:

maxπJ(π)+H(π) s.t.J(π)J(πH)(1)\boxed{ \begin{aligned} &\max_{\pi}\quad J(\pi) + H(\pi)\ &\text{s.t.}\quad J(\pi) \geq J(\pi_H) \end{aligned} } \tag{1}

This formulation ensures the optimized policy cannot, at any point, obtain lower task return than the baseline provided by πH\pi_H. Empirically, well-designed heuristics produce J(πH)>J(πJ)J(\pi_H) > J(\pi_J), where πJ\pi_J is trained on task reward alone; however, even poorly chosen heuristics are recovered by the mechanism of HEPO’s dynamic weighting.

2. Lagrangian Relaxation and Update Rule

HEPO introduces a nonnegative Lagrange multiplier α0\alpha \geq 0, yielding the Lagrangian:

L(π,α)=J(π)+H(π)+α(J(π)J(πH))minα0maxπL(π,α)(2)\mathcal{L}(\pi, \alpha) = J(\pi) + H(\pi) + \alpha \left( J(\pi) - J(\pi_H) \right) \quad\Longrightarrow\quad \min_{\alpha \geq 0} \max_{\pi} \mathcal{L}(\pi, \alpha) \tag{2}

Optimizing with respect to π\pi reduces to maximizing expected return under a modified reward:

rα,h(s,a)=(1+α)r(s,a)+h(s,a)r^{\alpha,h}(s,a) = (1+\alpha)r(s,a) + h(s,a)

The multiplier α\alpha is adapted by projected gradient ascent:

αL=J(π)J(πH)(4)\nabla_{\alpha}\,\mathcal{L} = J(\pi) - J(\pi_H) \tag{4}

α\alpha increases when J(π)<J(πH)J(\pi) < J(\pi_H), amplifying task reward weighting to restore the guarantee. Empirically, after convergence on well-engineered heuristics, α0\alpha \rightarrow 0, indicating that HH is no longer needed after surpassing πH\pi_H.

The practical instantiation with PPO alternates between sampling under π\pi and πH\pi_H, preparing advantages Arπ,AhπA^{\pi}_r, A^{\pi}_h and ArπH,AhπHA^{\pi_H}_r, A^{\pi_H}_h, and updating:

  • π\pi: maximize

Lπ=E[min(ρ(θ)((1+α)Arold+Ahold),clip(ρ(θ),1ϵ,1+ϵ)((1+α)Arold+Ahold))](5)L_\pi = \mathbb{E}[\min(\rho(\theta) ((1+\alpha)A^{\rm old}_r + A^{\rm old}_h), \mathrm{clip}(\rho(\theta),1-\epsilon,1+\epsilon) ((1+\alpha)A^{\rm old}_r + A^{\rm old}_h))] \tag{5}

  • πH\pi_H: maximize the surrogate for heuristic-only advantages.

The performance difference lemma enables estimation of J(π)J(πH)J(\pi) - J(\pi_H) using cross-policy advantage rollouts.

3. Theoretical Properties

HEPO guarantees at every iterate kk that J(πk+1)J(πH)J(\pi_{k+1}) \geq J(\pi_H) if the Lagrangian is optimized sufficiently, so the learner never underperforms relative to the best available policy trained purely with heuristics. This guarantee stands in contrast to methods based on policy invariance or naive reward addition, which may admit arbitrarily poor J(π)J(\pi) if HH is poorly set.

Under conservative update regimes (e.g., step size control or KL\mathrm{KL} trust regions as in TRPO), the standard monotonic improvement bounds are recovered, derived using the performance difference lemma:

J(π)J(π)=11γEsdπ,aπ[Arπ(s,a)]J(\pi') - J(\pi) = \frac{1}{1 - \gamma} \mathbb{E}_{s \sim d^{\pi'},\,a \sim \pi'} [A^{\pi}_r(s,a)]

A plausible implication is that HEPO possesses the stability of conservative policy iteration, but is robust to heuristic corruptions due to the adaptivity of α\alpha.

4. Algorithmic Workflow and Implementation

The HEPO procedure is implemented as follows:

  1. Data collection: At each iteration, collect B/2B/2 trajectories with π\pi, and B/2B/2 with πH\pi_H.
  2. Advantage estimation: Compute advantages ArπA_r^{\pi}, AhπA_h^{\pi} from π\pi data, and ArπHA_r^{\pi_H}, AhπHA_h^{\pi_H} from πH\pi_H data.
  3. Policy updates: Optimize π\pi via PPO mini-batch updates on the modified (task + heuristic) surrogate, and separately optimize πH\pi_H with PPO against heuristic-only returns.
  4. Multiplier update: Estimate ΔJ12(Eπ[ArπH]EπH[Arπ])\Delta J \approx \frac{1}{2}(\mathbb{E}_\pi[A_r^{\pi_H}] - \mathbb{E}_{\pi_H}[A_r^{\pi}]) and update α[αηαΔJ]+\alpha \leftarrow [\alpha - \eta_\alpha \Delta J]_+.
  5. Hyperparameters: Typical values include BB (trajectories per iteration), ϵ\epsilon (PPO clip), γ\gamma (discount), λGAE\lambda_{\mathrm{GAE}} (GAE parameter), policy learning rate ηθ=3×104\eta_\theta = 3 \times 10^{-4}, multiplier learning rate ηα=103\eta_\alpha = 10^{-3}.

Two policies are concurrently trained from shared data via importance sampling, with a network architecture comprising a two-layer 256-unit MLP with ReLU. A typical benchmark requires 1\approx 1B simulator steps and $5$ random seeds.

5. Heuristic Reward Construction

Heuristic signals h(s,a)h(s,a) encode human priors as dense rewards, facilitating exploration and overcoming sparse-reward learning plateaus. Examples include:

  • Locomotion (IsaacGym): Forward velocity, foot contact bonuses, joint-torque penalties.
  • Manipulation (Bi-Dex, FrankaCabinet): Gripper-handle distance (positive), grasp-force contact (positive), action magnitude (negative).

Notably, non-expert-designed heuristics frequently misweight or invert terms, as documented in a human study on FrankaCabinet. Here, 12 graduate students produced functions that sometimes erroneously rewarded moving away from the cabinet. HEPO’s adaptive α\alpha downweights deleterious heuristics in these cases, preserving task performance.

6. Empirical Performance and Robustness

HEPO was evaluated on the IsaacGym locomotion suite and Bi-Dex manipulation (29 tasks), as well as on human-designed (non-expert) heuristics. Normalized task return J~X\tilde{J}_X is defined relative to random and heuristic-only policies. Key results include:

Method IQM J~\tilde{J} PI (over H-only)
HEPO 0.62 0.62 (95% CI > 0.50)
H-only 0.44
J+H 0.40
PBRS, HuRL ≈0.0
EIPO 0.35

On FrankaCabinet with 12 non-expert heuristics, HEPO achieved IQM = 0.94 (vs. H-only 0.44), PI = 0.73, strictly outperforming heuristic-only PPO in 9 out of 12 cases.

Ablations indicate that only the HEPO constraint using πH\pi_H allows surpassing the heuristic policy. Joint policy/trajectory sampling reduces off-policy error and improves performance relative to alternating collection. Hyperparameter sweeps demonstrate HEPO’s robustness to λ\lambda and ηα\eta_\alpha; naïve reward addition is highly sensitive.

7. Practical Implications and Availability

HEPO enables RL practitioners to leverage dense heuristic signals without manual reward weighting, instead relying on a provably constrained formulation that adaptively trades off between task and heuristic rewards. The method preserves or outperforms strong heuristic baselines even under non-expert or misaligned heuristics, with minimal sensitivity to hyperparameter selection. Full implementation details, code, hyperparameters, and learning curves are available at https://github.com/Improbable-AI/hepo (Lee et al., 7 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Heuristic Enhanced Policy Optimization (HEPO).