Goal-Conditional POMDPs: Safe-Reachability & Planning
- Goal-Conditional POMDPs are decision-making models that integrate explicit goal conditions and safety constraints to guide actions under uncertainty.
- The framework incorporates bounded model checking and SMT-based synthesis to efficiently generate policies with formal guarantees on reaching goal states.
- Applications in robotic planning and information-gathering demonstrate significant improvements in computational efficiency and robust safety assurance.
A Goal-Conditional Partially Observable Markov Decision Process (POMDP) generalizes the standard POMDP framework by structuring decision-making under uncertainty around explicitly defined goal conditions. In this setting, an agent acts under state and observation uncertainty to either minimize expected cost or maximize the probability of reaching a goal state—potentially also satisfying strict safety constraints. Recent approaches have formalized these problems in goal-centric or safe-reachability settings, synthesizing policies with formal guarantees across all possible observation sequences, as well as introducing advanced algorithmic methods for efficient synthesis and planning (Wang et al., 2018, &&&1&&&).
1. Formal Definition and Problem Statement
A goal-based POMDP is typically defined by the tuple
where is a finite set of (hidden) world states, is the action set, the observation set, and are the transition and observation kernels, and the initial belief distribution. encodes the goal states (the agent's mission is complete when ). denotes the cost of action in state ( if ). The belief state is a probability distribution over , updated after an action-observation pair by
In the "safe-reachability" extension, the state space is additionally partitioned into goal states and unsafe states . Two thresholds are fixed:
- : maximum goal miss probability
- : maximum allowed probability of visiting unsafe states
The sets
define belief states corresponding to goal "almost-certain" and safety satisfaction.
A trajectory satisfies the safe-reachability objective if, for some , and , .
2. Goal-Constrained Belief Space and Bounded Model Checking
The conventional belief space is uncountably infinite. Effective policy synthesis thus requires restricting attention to a relevant subset. The "goal-constrained belief space" of horizon is defined as
This restriction is algorithmically realized by encoding the belief evolution and safety/liveness requirements as a first-order formula over beliefs, actions, and observations. The formula combines initial conditions, belief transitions, and a disjunction representing satisfaction of the destination and safety constraints:
where is the deterministic belief-update operator. This encoding enables bounded model checking (BMC) approaches where only sequences satisfying safety up to goal are enumerated, facilitating orders-of-magnitude reduction in candidates considered (Wang et al., 2018).
3. Policy Objectives: Safe-Reachability vs. Expected-Reward Formulations
Traditional POMDPs maximize expected cumulative reward (or minimize expected cost):
Safe-reachability objectives enforce
on every execution. This gives worst-case guarantees unattainable by reward-maximizing policies, which can be either overly conservative (never risking rare but necessary unsafe transitions) or risky (tolerating rare catastrophic failures if they make only a small contribution to expected cost). Goal-POMDP objectives (e.g., minimize expected cost-to-goal) and safe-reachability objectives are thus not interchangeable; the latter strictly enforces probabilistic pathwise specifications (Wang et al., 2018, Shani, 2024).
4. Solution Algorithms: Bounded Policy Synthesis and RTDP-BEL
Bounded Policy Synthesis (BPS): BPS searches for a finite-horizon policy of depth that provably meets safe-reachability objectives in all branches. Using the formula , an incremental Satisfiability Modulo Theories (SMT) solver is queried to enumerate valid plans. The BPS algorithm proceeds by proposing a candidate plan, recursively expanding branches for off-nominal observations, and blocking infeasible prefixes—yielding efficient, sound policy synthesis over large belief spaces. BPS leverages BMC encoding and solver-incrementality to reduce required solver calls and enables synthesis with worst-case guarantees (Wang et al., 2018).
RTDP-BEL (Real-Time Dynamic Programming over Beliefs): RTDP-BEL is tailored to goal-POMDPs where the agent terminates on reaching a goal belief. It simulates forward trajectories from , applying Bellman backups only along sampled paths and using admissible heuristics for initialization:
where is the expected cost under belief and is the belief-update. On convergence, RTDP-BEL provides a policy minimizing expected cost-to-goal, exploiting structure and focused rollouts to avoid full belief space enumeration (Shani, 2024).
5. Admissible Heuristics for Goal-POMDPs: Delete-Relaxation and Value-of-Information
Efficient solution of goal-conditional POMDPs in challenging domains relies on informative heuristics. The belief-delete-relaxation, value-of-information heuristic constructs a relaxed planning graph in belief space, simulating positive effects of actions, modeling stochastic delays, and tracking information elimination via sensing. At each layer, actions are tested over the current support of the belief, and sensing is used to iteratively reduce valid states—optimistically extracting a contingent plan whose length provides an admissible heuristic for RTDP-BEL rollouts (Shani, 2024).
The cost of this heuristic is , which is mitigated via caching on the valid set and most-likely state. In complex information-gathering domains (e.g., Maze, Wumpus), its use reduces the number of RTDP-BEL trajectories by one to two orders of magnitude, shrinking overall solve time by compared to cheaper MDP-based heuristics.
6. Empirical Evaluation and Practical Applications
Robotic Planning Case Study: BPS was evaluated in a simulated PR2 kitchen scavenging scenario with uncertain obstacles. Notable findings include:
- Synthesis time remains within seconds for up to obstacles, with incremental SMT solving yielding speedups over non-incremental approaches.
- The number of SMT calls is , vastly lower than the combinatorial explosion in the full reachable belief tree ( branches at depth 20).
- All synthesized policies provably satisfy the required probability-of-reach and probability-of-unsafe thresholds on every trajectory.
Heuristic-Guided RTDP-BEL: Across classic planning benchmarks (Localization, Logistics, Blocks, Maze, Wumpus), integrating the belief-delete-relaxation heuristic into RTDP-BEL reduces trajectory count and wall-clock solve time dramatically, especially when information-gathering is fundamental to task completion. In simple myopic-sensing domains, lightweight QMDP or hindsight maximum likelihood (HML) heuristics suffice, but complex sensing settings favor the structured heuristic (Shani, 2024).
7. Practical Considerations, Extensions, and Challenges
Embedding effective admissible heuristics and bounded symbolic encodings enables goal-conditional and safe-reachability POMDPs to scale to problems of realistic complexity. Key practical considerations include:
- Caching heuristic computations by belief support or most-likely state.
- Amortizing SMT solver costs via incremental scope management in BPS.
- For expensive heuristics or large support sizes, considering approximate belief tracking or support pruning.
Extensions of these frameworks accommodate non-unit costs, stochastic observations, and integration with point-based POMDP planners for improved solution quality. Domains where sensing is always necessary or belief supports are massive may pose computational challenges, making alternative factored or sampled belief representations necessary (Wang et al., 2018, Shani, 2024).
In summary, goal-conditional and safe-reachability POMDPs represent a critical evolution in planning under partial observability, providing formal methods for synthesizing robust policies with explicit trajectory-level guarantees on reaching goal states and avoiding unsafe regions, and supporting efficient policy computation through symbolic bounded verification and heuristically guided dynamic programming.