- The paper introduces a dual-objective RL framework via explicit HJB formulations that decompose value functions for Reach-Always-Avoid and Reach-Reach tasks.
- The methodology leverages state augmentation to track historical best rewards and worst penalties, ensuring deterministic optimal policies.
- The proposed DO-HJ-PPO algorithm improves performance over baselines by achieving higher success rates and fewer steps in complex, multi-objective tasks.
This paper introduces a formal approach to dual-objective reinforcement learning (RL) problems through explicit Hamilton-Jacobi-BeLLMan (HJB) formulations. The focus is on two classes of objectives: Reach-Always-Avoid (RAA) and Reach-Reach (RR). The work establishes new theoretical constructions, value function decompositions, and corresponding BeLLMan equations, and presents an efficient Proximal Policy Optimization variant (DO-HJ-PPO) capable of solving these problems more reliably and robustly than prior baselines, particularly in the context of safety-critical and multi-goal RL.
Motivation and Context
Traditional approaches such as Constrained Markov Decision Processes (CMDPs), Multi-Objective RL, and Temporal Logic-based RL have sought to address problems involving safety and liveness or the satisfaction of complex task specifications. However, these approaches often suffer from practical intractability, intricate reward engineering, lack of explicit solution structure, and difficulties in simultaneous, worst-case satisfaction of multiple objectives.
Recent work connecting HJ reachability analysis and RL has provided tools for directly encoding reach and avoid tasks via non-standard value functions, but prior formulations have mostly been limited to "single-objective" variants: reach, avoid, and reach-avoid. This paper generalizes these to dual-objective settings by rigorously deriving tractable value function decompositions and corresponding BeLLMan equations.
The two principal problem classes are:
- Reach-Always-Avoid (RAA): Maximize the worst-case value between the best reward encountered and the worst penalty incurred during the trajectory.
- Reach-Reach (RR): Maximize the lesser of two best-over-time rewards, ensuring both are attained.
Concretely, the objectives are:
- RAA: maximize min { maxₜ r(sₜ), minₜ q(sₜ) }
- RR: maximize min { maxₜ r₁(sₜ), maxₜ r₂(sₜ) }
where q(s) = –p(s) for "penalty" p, and r(s), r₁(s), r₂(s) are potentially distinct reward functions.
Unlike more common sum-of-discounted reward objectives, these definitions require policies that reason about the history-bounded extrema (max/min) across a trajectory. Critically, the paper demonstrates that state augmentation—tracking the best reward and worst penalty so far—provides the minimal sufficient statistic for optimal control, rendering full trajectory histories or stochastic policies unnecessary.
Theoretical Contributions
The main theoretical results can be summarized as follows:
1. Value Function Decomposition
Both RAA and RR problems admit constructive decompositions into a sequence of simpler sub-problems—reach, avoid, and reach-avoid. In particular, the principal results establish that:
- RAA: Can be solved via a preliminary avoid problem (optimal value for q), followed by a reach-avoid problem using the minimum of r and the avoid value as the effective reward.
- RR: Can be solved via two reach subproblems to find the individual optimal reach values, together with a final reach problem over a composed reward.
This decomposition enables the construction of explicit, tractable BeLLMan equations (and their discounted variants) for these tasks.
The necessity of state augmentation is formalized via corrective examples and a general result: no further historical information or stochasticity improves performance over deterministic policies built on the augmented state (i.e., the current state, current best reward(s), current worst penalty). Practically, this guarantees that optimal policies can be learned using standard RL machinery upon suitable augmentation.
The decomposed value functions for RAA and RR admit explicit BeLLMan forms, expressible as min/max reductions over augmented state variables, and compatible with stochastic RL algorithms by extension of the reach-avoid BeLLMan equation to the stochastic case.
DO-HJ-PPO: An Algorithm for Dual-Objective RL
Based on the above theoretical foundations, the paper advances DO-HJ-PPO, a generalization of PPO that directly learns decomposed and composed value functions in lockstep, leveraging specialized BeLLMan updates corresponding to the RAA and RR problem classes.
Key implementation features include:
- Multiple critic and actor heads for each decomposed objective and their composition.
- Augmented MDP encoding to ensure correct historical tracking.
- Discounted BeLLMan operators and generalized advantage estimation matching the task structure.
- Coupled resets and synchronized learning of decomposed/composed value functions, addressing on-policy estimation.
The pseudo-algorithm is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
for epoch in range(num_epochs):
for rollout in range(num_rollouts):
# Sample trajectory using composed and decomposed policies
s, y, z = reset_augmented_state()
for t in range(trajectory_length):
a = composed_actor(s, y, z)
s_next, r, q = env.step(a)
y_next = update_best_reward(y, r)
z_next = update_worst_penalty(z, q)
store_transition(s, y, z, a, r, q, s_next, y_next, z_next)
s, y, z = s_next, y_next, z_next
# Compute advantages using custom BeLLMan reductions
advantages = compute_custom_GAE(transitions)
# Policy and value network updates for all (decomposed and composed) heads
update_policy_and_value_networks(transitions, advantages) |
Empirical Evidence
Experiments are conducted on both grid-world scenarios—illustrating policy behaviors in discrete settings—and continuous control domains (Hopper and F16). DO-HJ-PPO is compared against CPPO and other decomposed approaches. Notably, DO-HJ-PPO consistently achieves higher task success rates and fewer steps to completion, often with a statistically significant gap. In particular:
- On RR tasks, CPPO variants frequently fail to solve any trajectories, while DO-HJ-PPO succeeds on a majority.
- On RAA tasks, DO-HJ-PPO attains higher safe-arrival percentages.
No hyperparameter tuning beyond minor adaptation is reported as necessary for these results.
Implications and Future Directions
This work broadens the class of RL tasks amenable to principled solution via explicit BeLLMan reasoning—moving beyond classic safety and reward formulations into richer, compositionally complex spaces.
Implications include:
- Safe RL: The methods serve as a rigorous foundation for RL in domains with strict safety specifications, e.g., robotics, aviation, autonomous driving, offering a direct mechanism for multi-constraint satisfaction.
- Multi-Objective Planning: Enables scalable, decomposable learning for tasks where worst-case satisfaction across multiple objectives is critical.
- Algorithm Design: Provides a blueprint for extending RL to new complex objective classes through principled decomposition and augmentation, rather than surrogate or heuristic reformulations.
Future Developments:
- Extension to n-objective settings and generalized compositions (e.g., nested min/max combinations).
- Incorporation with model-based and offline RL, leveraging the explicit structure for robust safety guarantee under uncertainty.
- Application to high-dimensional, partially observable, or multi-agent systems with shared or conflicting safety/reward goals.
- Formal paper of sample efficiency and convergence properties for the augmented systems in the context of large-scale function approximation.
Summary Table
Aspect |
DO-HJ-PPO (Ours) |
Standard Baselines (e.g., CPPO) |
Value Structure |
Explicit min/max BeLLMan forms |
Lagrangian or surrogate constraints |
Policy Class |
Deterministic (augmented state) |
Stochastic, potentially suboptimal |
Task Complexity |
Multi-objective, worst-case |
Typically single-objective or heuristics |
Empirical Results |
High success, lower steps |
Frequent failure, inconsistent reward |
Implementation |
PPO with augmented states, custom GAEs |
PPO/Lagrangian modifications, more tuning |
The results demonstrate that explicit dual-objective BeLLMan formulations yield concrete, tractable algorithms for safety-critical and multi-goal RL, offering improved performance and reliability over existing constrained or heuristic approaches. This marks a substantial step toward a systematic theory of compositional objectives in RL and sets the stage for broader adoption of principled BeLLMan-based design in safety- and mission-critical learning applications.