Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Dual-Objective Reinforcement Learning with Novel Hamilton-Jacobi-Bellman Formulations (2506.16016v1)

Published 19 Jun 2025 in cs.AI, cs.SY, and eess.SY

Abstract: Hard constraints in reinforcement learning (RL), whether imposed via the reward function or the model architecture, often degrade policy performance. Lagrangian methods offer a way to blend objectives with constraints, but often require intricate reward engineering and parameter tuning. In this work, we extend recent advances that connect Hamilton-Jacobi (HJ) equations with RL to propose two novel value functions for dual-objective satisfaction. Namely, we address: (1) the Reach-Always-Avoid problem - of achieving distinct reward and penalty thresholds - and (2) the Reach-Reach problem - of achieving thresholds of two distinct rewards. In contrast with temporal logic approaches, which typically involve representing an automaton, we derive explicit, tractable BeLLMan forms in this context by decomposing our problem into reach, avoid, and reach-avoid problems, as to leverage these aforementioned recent advances. From a mathematical perspective, the Reach-Always-Avoid and Reach-Reach problems are complementary and fundamentally different from standard sum-of-rewards problems and temporal logic problems, providing a new perspective on constrained decision-making. We leverage our analysis to propose a variation of Proximal Policy Optimization (DO-HJ-PPO), which solves these problems. Across a range of tasks for safe-arrival and multi-target achievement, we demonstrate that DO-HJ-PPO produces qualitatively distinct behaviors from previous approaches and out-competes a number of baselines in various metrics.

Summary

  • The paper introduces a dual-objective RL framework via explicit HJB formulations that decompose value functions for Reach-Always-Avoid and Reach-Reach tasks.
  • The methodology leverages state augmentation to track historical best rewards and worst penalties, ensuring deterministic optimal policies.
  • The proposed DO-HJ-PPO algorithm improves performance over baselines by achieving higher success rates and fewer steps in complex, multi-objective tasks.

Dual-Objective RL via Hamilton-Jacobi-BeLLMan Formulations: A Formal Perspective

This paper introduces a formal approach to dual-objective reinforcement learning (RL) problems through explicit Hamilton-Jacobi-BeLLMan (HJB) formulations. The focus is on two classes of objectives: Reach-Always-Avoid (RAA) and Reach-Reach (RR). The work establishes new theoretical constructions, value function decompositions, and corresponding BeLLMan equations, and presents an efficient Proximal Policy Optimization variant (DO-HJ-PPO) capable of solving these problems more reliably and robustly than prior baselines, particularly in the context of safety-critical and multi-goal RL.

Motivation and Context

Traditional approaches such as Constrained Markov Decision Processes (CMDPs), Multi-Objective RL, and Temporal Logic-based RL have sought to address problems involving safety and liveness or the satisfaction of complex task specifications. However, these approaches often suffer from practical intractability, intricate reward engineering, lack of explicit solution structure, and difficulties in simultaneous, worst-case satisfaction of multiple objectives.

Recent work connecting HJ reachability analysis and RL has provided tools for directly encoding reach and avoid tasks via non-standard value functions, but prior formulations have mostly been limited to "single-objective" variants: reach, avoid, and reach-avoid. This paper generalizes these to dual-objective settings by rigorously deriving tractable value function decompositions and corresponding BeLLMan equations.

Formal Problem Settings

The two principal problem classes are:

  • Reach-Always-Avoid (RAA): Maximize the worst-case value between the best reward encountered and the worst penalty incurred during the trajectory.
  • Reach-Reach (RR): Maximize the lesser of two best-over-time rewards, ensuring both are attained.

Concretely, the objectives are:

  • RAA: maximize min { maxₜ r(sₜ), minₜ q(sₜ) }
  • RR: maximize min { maxₜ r₁(sₜ), maxₜ r₂(sₜ) }

where q(s) = –p(s) for "penalty" p, and r(s), r₁(s), r₂(s) are potentially distinct reward functions.

Unlike more common sum-of-discounted reward objectives, these definitions require policies that reason about the history-bounded extrema (max/min) across a trajectory. Critically, the paper demonstrates that state augmentation—tracking the best reward and worst penalty so far—provides the minimal sufficient statistic for optimal control, rendering full trajectory histories or stochastic policies unnecessary.

Theoretical Contributions

The main theoretical results can be summarized as follows:

1. Value Function Decomposition

Both RAA and RR problems admit constructive decompositions into a sequence of simpler sub-problems—reach, avoid, and reach-avoid. In particular, the principal results establish that:

  • RAA: Can be solved via a preliminary avoid problem (optimal value for q), followed by a reach-avoid problem using the minimum of r and the avoid value as the effective reward.
  • RR: Can be solved via two reach subproblems to find the individual optimal reach values, together with a final reach problem over a composed reward.

This decomposition enables the construction of explicit, tractable BeLLMan equations (and their discounted variants) for these tasks.

2. Augmented MDP Formulation

The necessity of state augmentation is formalized via corrective examples and a general result: no further historical information or stochasticity improves performance over deterministic policies built on the augmented state (i.e., the current state, current best reward(s), current worst penalty). Practically, this guarantees that optimal policies can be learned using standard RL machinery upon suitable augmentation.

3. Closed-form BeLLMan Equations

The decomposed value functions for RAA and RR admit explicit BeLLMan forms, expressible as min/max reductions over augmented state variables, and compatible with stochastic RL algorithms by extension of the reach-avoid BeLLMan equation to the stochastic case.

DO-HJ-PPO: An Algorithm for Dual-Objective RL

Based on the above theoretical foundations, the paper advances DO-HJ-PPO, a generalization of PPO that directly learns decomposed and composed value functions in lockstep, leveraging specialized BeLLMan updates corresponding to the RAA and RR problem classes.

Key implementation features include:

  • Multiple critic and actor heads for each decomposed objective and their composition.
  • Augmented MDP encoding to ensure correct historical tracking.
  • Discounted BeLLMan operators and generalized advantage estimation matching the task structure.
  • Coupled resets and synchronized learning of decomposed/composed value functions, addressing on-policy estimation.

The pseudo-algorithm is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for epoch in range(num_epochs):
    for rollout in range(num_rollouts):
        # Sample trajectory using composed and decomposed policies
        s, y, z = reset_augmented_state()
        for t in range(trajectory_length):
            a = composed_actor(s, y, z)
            s_next, r, q = env.step(a)
            y_next = update_best_reward(y, r)
            z_next = update_worst_penalty(z, q)
            store_transition(s, y, z, a, r, q, s_next, y_next, z_next)
            s, y, z = s_next, y_next, z_next

    # Compute advantages using custom BeLLMan reductions
    advantages = compute_custom_GAE(transitions)
    # Policy and value network updates for all (decomposed and composed) heads
    update_policy_and_value_networks(transitions, advantages)

Empirical Evidence

Experiments are conducted on both grid-world scenarios—illustrating policy behaviors in discrete settings—and continuous control domains (Hopper and F16). DO-HJ-PPO is compared against CPPO and other decomposed approaches. Notably, DO-HJ-PPO consistently achieves higher task success rates and fewer steps to completion, often with a statistically significant gap. In particular:

  • On RR tasks, CPPO variants frequently fail to solve any trajectories, while DO-HJ-PPO succeeds on a majority.
  • On RAA tasks, DO-HJ-PPO attains higher safe-arrival percentages.

No hyperparameter tuning beyond minor adaptation is reported as necessary for these results.

Implications and Future Directions

This work broadens the class of RL tasks amenable to principled solution via explicit BeLLMan reasoning—moving beyond classic safety and reward formulations into richer, compositionally complex spaces.

Implications include:

  • Safe RL: The methods serve as a rigorous foundation for RL in domains with strict safety specifications, e.g., robotics, aviation, autonomous driving, offering a direct mechanism for multi-constraint satisfaction.
  • Multi-Objective Planning: Enables scalable, decomposable learning for tasks where worst-case satisfaction across multiple objectives is critical.
  • Algorithm Design: Provides a blueprint for extending RL to new complex objective classes through principled decomposition and augmentation, rather than surrogate or heuristic reformulations.

Future Developments:

  • Extension to n-objective settings and generalized compositions (e.g., nested min/max combinations).
  • Incorporation with model-based and offline RL, leveraging the explicit structure for robust safety guarantee under uncertainty.
  • Application to high-dimensional, partially observable, or multi-agent systems with shared or conflicting safety/reward goals.
  • Formal paper of sample efficiency and convergence properties for the augmented systems in the context of large-scale function approximation.

Summary Table

Aspect DO-HJ-PPO (Ours) Standard Baselines (e.g., CPPO)
Value Structure Explicit min/max BeLLMan forms Lagrangian or surrogate constraints
Policy Class Deterministic (augmented state) Stochastic, potentially suboptimal
Task Complexity Multi-objective, worst-case Typically single-objective or heuristics
Empirical Results High success, lower steps Frequent failure, inconsistent reward
Implementation PPO with augmented states, custom GAEs PPO/Lagrangian modifications, more tuning

The results demonstrate that explicit dual-objective BeLLMan formulations yield concrete, tractable algorithms for safety-critical and multi-goal RL, offering improved performance and reliability over existing constrained or heuristic approaches. This marks a substantial step toward a systematic theory of compositional objectives in RL and sets the stage for broader adoption of principled BeLLMan-based design in safety- and mission-critical learning applications.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.