Heuristic Enhanced Policy Optimization (HEPO)
- HEPO is a constrained policy optimization framework that integrates human-inspired heuristic rewards to ensure the task return never falls below that of the heuristic policy.
- It employs Lagrangian relaxation to adaptively balance task rewards with heuristic signals, resulting in significant empirical improvements across diverse benchmarks.
- HEPO offers robustness in reinforcement learning by dynamically updating a multiplier to correct misaligned heuristics without manual reward weighting.
Heuristic Enhanced Policy Optimization (HEPO) is a constrained policy optimization framework for reinforcement learning (RL) that systematically integrates heuristic rewards—dense, human-prior-inspired signals—while provably guaranteeing that the achieved task return never falls below the baseline attained by the “heuristic policy.” HEPO addresses the well-known challenge of reward misalignment and “reward hacking” in practical RL, providing a robust alternative to ad hoc weighting of reward terms by recasting the weighting problem as a constrained maximization task. Notably, HEPO delivers significant empirical improvements across both standard and non-expert-designed benchmarks, and is distinguished by its adaptive, Lagrangian-based approach to weighting heuristics and task rewards (Lee et al., 7 Jul 2025).
1. Constrained Formulation of HEPO
Let denote the expected task return of policy , and the expected heuristic return. The “heuristic policy” is the policy obtained, e.g., by Proximal Policy Optimization (PPO), that maximizes .
Instead of balancing rewards via a scalar in the objective , HEPO imposes a constraint:
This formulation ensures the optimized policy cannot, at any point, obtain lower task return than the baseline provided by . Empirically, well-designed heuristics produce , where is trained on task reward alone; however, even poorly chosen heuristics are recovered by the mechanism of HEPO’s dynamic weighting.
2. Lagrangian Relaxation and Update Rule
HEPO introduces a nonnegative Lagrange multiplier , yielding the Lagrangian:
Optimizing with respect to reduces to maximizing expected return under a modified reward:
The multiplier is adapted by projected gradient ascent:
increases when , amplifying task reward weighting to restore the guarantee. Empirically, after convergence on well-engineered heuristics, , indicating that is no longer needed after surpassing .
The practical instantiation with PPO alternates between sampling under and , preparing advantages and , and updating:
- : maximize
- : maximize the surrogate for heuristic-only advantages.
The performance difference lemma enables estimation of using cross-policy advantage rollouts.
3. Theoretical Properties
HEPO guarantees at every iterate that if the Lagrangian is optimized sufficiently, so the learner never underperforms relative to the best available policy trained purely with heuristics. This guarantee stands in contrast to methods based on policy invariance or naive reward addition, which may admit arbitrarily poor if is poorly set.
Under conservative update regimes (e.g., step size control or trust regions as in TRPO), the standard monotonic improvement bounds are recovered, derived using the performance difference lemma:
A plausible implication is that HEPO possesses the stability of conservative policy iteration, but is robust to heuristic corruptions due to the adaptivity of .
4. Algorithmic Workflow and Implementation
The HEPO procedure is implemented as follows:
- Data collection: At each iteration, collect trajectories with , and with .
- Advantage estimation: Compute advantages , from data, and , from data.
- Policy updates: Optimize via PPO mini-batch updates on the modified (task + heuristic) surrogate, and separately optimize with PPO against heuristic-only returns.
- Multiplier update: Estimate and update .
- Hyperparameters: Typical values include (trajectories per iteration), (PPO clip), (discount), (GAE parameter), policy learning rate , multiplier learning rate .
Two policies are concurrently trained from shared data via importance sampling, with a network architecture comprising a two-layer 256-unit MLP with ReLU. A typical benchmark requires B simulator steps and $5$ random seeds.
5. Heuristic Reward Construction
Heuristic signals encode human priors as dense rewards, facilitating exploration and overcoming sparse-reward learning plateaus. Examples include:
- Locomotion (IsaacGym): Forward velocity, foot contact bonuses, joint-torque penalties.
- Manipulation (Bi-Dex, FrankaCabinet): Gripper-handle distance (positive), grasp-force contact (positive), action magnitude (negative).
Notably, non-expert-designed heuristics frequently misweight or invert terms, as documented in a human study on FrankaCabinet. Here, 12 graduate students produced functions that sometimes erroneously rewarded moving away from the cabinet. HEPO’s adaptive downweights deleterious heuristics in these cases, preserving task performance.
6. Empirical Performance and Robustness
HEPO was evaluated on the IsaacGym locomotion suite and Bi-Dex manipulation (29 tasks), as well as on human-designed (non-expert) heuristics. Normalized task return is defined relative to random and heuristic-only policies. Key results include:
| Method | IQM | PI (over H-only) |
|---|---|---|
| HEPO | 0.62 | 0.62 (95% CI > 0.50) |
| H-only | 0.44 | — |
| J+H | 0.40 | — |
| PBRS, HuRL | ≈0.0 | — |
| EIPO | 0.35 | — |
On FrankaCabinet with 12 non-expert heuristics, HEPO achieved IQM = 0.94 (vs. H-only 0.44), PI = 0.73, strictly outperforming heuristic-only PPO in 9 out of 12 cases.
Ablations indicate that only the HEPO constraint using allows surpassing the heuristic policy. Joint policy/trajectory sampling reduces off-policy error and improves performance relative to alternating collection. Hyperparameter sweeps demonstrate HEPO’s robustness to and ; naïve reward addition is highly sensitive.
7. Practical Implications and Availability
HEPO enables RL practitioners to leverage dense heuristic signals without manual reward weighting, instead relying on a provably constrained formulation that adaptively trades off between task and heuristic rewards. The method preserves or outperforms strong heuristic baselines even under non-expert or misaligned heuristics, with minimal sensitivity to hyperparameter selection. Full implementation details, code, hyperparameters, and learning curves are available at https://github.com/Improbable-AI/hepo (Lee et al., 7 Jul 2025).