HEPO: Heuristic Enhanced Policy Optimization
- HEPO is a reinforcement learning framework that integrates heuristic rewards while enforcing a constraint to outperform a given heuristic baseline.
- It reformulates reward augmentation as a constrained maximization problem using a Lagrange multiplier to balance task and heuristic rewards.
- HEPO demonstrates enhanced robustness and superior empirical performance by preventing reward hacking and ensuring policy improvement in challenging benchmarks.
Heuristic Enhanced Policy Optimization (HEPO) is a reinforcement learning (RL) framework designed to leverage heuristic reward functions while robustly preventing reward hacking and underperformance relative to heuristic policies. Rather than assuming that the optimal solution is invariant to the presence of heuristic rewards, HEPO casts reward augmentation as a constrained maximization problem that prioritizes policy improvement over the heuristic baseline. This approach enables both efficient learning from human priors and performance guarantees relative to pre-designed or concurrently optimized heuristic policies, leading to improved robustness in challenging benchmarks and with non-expert heuristic design (Lee et al., 7 Jul 2025).
1. Formulation of the HEPO Objective
HEPO is defined by introducing a constraint that the learned policy should never underperform compared to a heuristic policy with respect to the true task reward. The base objectives are:
- True task objective:
- Heuristic objective:
Given a heuristic policy (pre-trained or trained concurrently), HEPO formulates the optimization problem as:
This constraint is operationalized using a Lagrangian relaxation, resulting in an unconstrained min-max objective:
The Lagrange multiplier adaptively re-weights the influence of the primary task reward in response to whether the current policy meets or exceeds the performance of the heuristic policy on the true reward.
2. Policy and Multiplier Update Mechanisms
HEPO’s optimization alternates between policy updates and Lagrange multiplier updates, leveraging the following structure:
- Policy (π) update: For a fixed , optimizing reduces to standard RL with a “mixed” reward function:
Thus, any standard on-policy RL algorithm (e.g., Proximal Policy Optimization, PPO) can be used directly with the modified rewards.
- Heuristic policy (π_H) update: Trained on the heuristic reward, with possible cross-pollination of off-policy advantages from to and vice versa, increasing sample efficiency.
- Multiplier (α) update: Adjusts as
where denotes projection onto , and is a step size. If the candidate policy outperforms the baseline, is reduced, decreasing emphasis on ; otherwise, is increased, up-weighting to push above .
3. High-Level Algorithmic Workflow
The practical implementation of HEPO involves the following iterative process (with on-policy subroutines):
- Trajectory collection: Half the trajectories are sampled from the current policy , half from the heuristic policy .
- Computation of advantages: For both and using Generalized Advantage Estimation (-GAE) for both policies.
- Policy update: Both and are updated according to mixed-advantage objectives, leveraging importance sampling ratios for off-policy rollouts.
- Lagrange multiplier update: is updated via stochastic gradient descent using estimated returns from the most recent rollouts.
This structure enables stable improvement over heuristic-only or invariance-based approaches, particularly when off-policy value estimation is feasible due to shared data collection.
4. Theoretical Considerations and Empirical Behavior
Traditional approaches, such as potential-based reward shaping, design heuristic augmentation to satisfy a policy invariance property—yielding identical optima for the primary task as training with only the primary reward. However, these methods offer no guarantee of improved performance relative to a heuristic policy and may underutilize informative heuristics, especially if the optimal primary reward policy is difficult to discover.
HEPO, by contrast, applies a policy improvement constraint at each iteration. This is analogous to conservative policy iteration, where the optimization is tethered to a reference policy. This ensures the updated policy does not degrade below the heuristic baseline and often substantially surpasses it, although no guarantee is made about global optimality with respect to alone under unconstrained updates (Lee et al., 7 Jul 2025).
5. Experimental Evaluation
HEPO was evaluated on extensive RL benchmarks, including IsaacGym locomotion (9 tasks) and Bi-Dexterous-Manipulation (20 tasks), using both well-engineered “expert” heuristic reward functions and heuristics authored by non-expert humans.
Key empirical findings:
- Normalized return metric: .
- Aggregation statistic: Interquartile mean (IQM) of normalized return and probability of improvement (PI) vs. heuristic-only policy across all tasks.
- Performance summary: HEPO achieved the highest IQM and PI () over the heuristic baseline (lower-bound confidence interval ) across 29 tasks. In contrast, “J-only” significantly underperformed, while “J+H” methods were highly sensitive to tuning, and other invariance- or constraint-based competitors underperformed.
- Human heuristics: For a challenging manipulation task (FrankaCabinet) with 12 human-designed heuristics, PPO(H-only) failed in 3 cases, whereas HEPO improved upon all H-only baselines, yielding IQM and PI .
6. Robustness, Hyperparameters, and Implementation Recommendations
- RL subroutine: PPO with RL-Games codebase defaults:
- , GAE , PPO clip
- Policy/value net learning rate ; entropy bonus
- Rollouts: 2048 steps; minibatch size 256; 10 PPO epochs per iteration
- Lagrange multiplier: Initialize ; LR
- Batch composition: Even split between and rollouts; shared data used for advantage estimation via importance weighting
- Trial statistics: 5 random seeds per method; 3M (IsaacGym) or 5M (Bi-Dex) steps per run using commodity GPU hardware
- Hyperparameter sensitivity: Performance is robust to and initialization as long as they remain in moderate ranges
- Ablation studies: The choice of policy improvement reference (enforcing vs. , as in EIPO) and joint trajectory sampling both enhanced stability and performance.
7. Context and Implications
HEPO redefines the integration of heuristic signals in RL by prioritizing improvement over a heuristic baseline rather than only optimizing for a combined reward objective or preserving policy invariance. This strategy provides practitioners with a practical guarantee: the learned policy will not regress below the performance level of the best heuristic policy currently available, and can, in practice, acquire significantly better strategies—even when the heuristic is noisy or suboptimal.
A plausible implication is that HEPO facilitates broader adoption of RL in domains where heuristic engineering is expensive or error-prone, as it demonstrably reduces the cost and expertise required for reward design. Furthermore, the HEPO framework is compatible with standard RL pipeline implementations and is agnostic to the source and fidelity of the heuristic reward, making it widely applicable within both academic research and industry practice (Lee et al., 7 Jul 2025).