Filtering Markov Decision Processes

Updated 24 April 2026

Filtering MDPs are modified decision processes that use explicit filtering to remove redundant, unreachable, or unsafe regions, enhancing computational tractability and safety.
Belief filtering in POMDPs uses a threshold to deduplicate similar probability distributions, reducing sample size and speeding up value backups with modest gains in expected return.
Structured reachability and safety filtering prune unreachable states and enforce safe actions, balancing computational efficiency with optimality in both classical and partially observable settings.

A Filtering Markov Decision Process (MDP) is any Markov decision process that has been reduced, restricted, or altered via an explicit filtering operation, typically to enhance computational tractability, enforce safety, or accelerate training. In contemporary research, "filtering" encompasses belief-set deduplication in point-based POMDP solvers, structured state-space reachability pruning in factored MDPs, and categorical safe-action enforcement through environmental wrappers. Across these paradigms, the core objective is to discard or avoid regions of the state, belief, or action spaces that are either redundant given the current planning algorithm, cannot be reached from designated initial conditions, or violate critical invariants (e.g., safety). Filtering serves as an essential tool for achieving reliable and scalable solution methods for both classical and partially observable planning problems.

1. Belief Filtering in Point-Based POMDP Solvers

For partially observable Markov decision processes (POMDPs), the most prominent filtering operation is performed within point-based approximate dynamic programming, particularly in algorithms such as PERSEUS. The system operates in a high-dimensional belief space

$\mathcal B = \left\{\,b\in\mathbb{R}^{|S|}\mid b(s)\geq0\, \forall s,\ \sum_{s\in S} b(s) = 1 \right\}$

where each belief $b$ is a probability distribution over $|S|$ latent states.

The filtering method proposed by Li and Hsu (Hsu, 2021) identifies and removes beliefs that are "similar" according to the infinity norm: $\|b_i-b_j\|_\infty = \max_{s\in S} |b_i(s) - b_j(s)| < \varepsilon$ where $\varepsilon$ is a user-supplied threshold. During belief sampling, beliefs that are within $\varepsilon$ of any previously retained belief are discarded, reducing the set size before the computationally intensive value-backup phase. When applied on benchmark problems such as Hallway2 ( $|S|=92$ ), belief filtering with $\varepsilon=0.01$ reduced the belief set from 10,000 to approximately 3,212, resulting in a $\sim$ 27\% wall-clock speedup and a 2.2\% increase in expected return, as compared to a baseline without filtering.

Critically, the effectiveness of this approach is most pronounced when the raw belief ensemble contains a high density of near-duplicates—a condition commonly satisfied in structured environments. However, empirical tuning of $\varepsilon$ is required: overly aggressive filtering can degrade policy quality by removing beliefs critical to optimal decision making. This method is applicable to point-based solvers where per-iteration complexity scales linearly with the belief set size, but inappropriate for Monte Carlo integration schemes where diversity in samples is essential for accurate estimation (Hsu, 2021).

2. Structured Reachability-Based Filtering in Factored MDPs

For fully observable, factored MDPs, Boutilier, Brafman, and Geib (Boutilier et al., 2013) established a family of algorithms for reachability analysis that enable structured domain filtering. Here, the state space is encoded by a vector of discrete variables $b$ 0, and transitions/rewards are captured by dynamic Bayesian networks (DBNs) and decision-tree representations.

Structured reachability filtering involves two main constructs:

The recording of reachable variable-values and exclusion constraints (binary or $b$ 1-ary), producing an implicit approximation $b$ 2 of the true reachable set $b$ 3.
Iterative alternation between action-level and propositional-level propagation: in each step, action effects and conflict information are used to update the set of reachable variable-values and exclusions.

A fixed $b$ 4-ary exclusion constraint propagation allows trade-off between completeness and computational cost: larger $b$ 5 increases precision but at a combinatorial computational penalty.

Once reachability filtering converges, MDP filtering is achieved by:

Pruning CPTs and reward trees to remove unreachable variable-values.
Pruning conditionals that violate exclusion constraints.
Collapsing nodes in the DBN that are now constant due to pruning.

This yields a reduced MDP $b$ 6 that is at least as tractable as the original and, by construction, equivalent in its solution restricted to reachable initial conditions: $b$ 7 where $b$ 8 and $b$ 9 are corresponding optimal policies for $|S|$ 0 and $|S|$ 1. Thus, structured filtering via reachability can produce exact reductions, provided the reachable set approximation $|S|$ 2 is sound (Boutilier et al., 2013).

3. Safety Filtering and Filtered MDP Construction

Safety filtering defines a class of filtered MDPs in which the transition dynamics are altered via a safety filter $|S|$ 3. This filter enforces categorical avoidance of unsafe (failure) states $|S|$ 4. Given a safety-critical MDP (SC-MDP)

$|S|$ 5

with a maximal controlled-invariant set $|S|$ 6 and safe-action sets

$|S|$ 7

a perfect filter ensures that all actions taken are mapped to safe actions, i.e., $|S|$ 8 for any $|S|$ 9, and $\|b_i-b_j\|_\infty = \max_{s\in S} |b_i(s) - b_j(s)| < \varepsilon$ 0 if $\|b_i-b_j\|_\infty = \max_{s\in S} |b_i(s) - b_j(s)| < \varepsilon$ 1 is safe.

The resulting filtered MDP $\|b_i-b_j\|_\infty = \max_{s\in S} |b_i(s) - b_j(s)| < \varepsilon$ 2 has modified dynamics and rewards: $\|b_i-b_j\|_\infty = \max_{s\in S} |b_i(s) - b_j(s)| < \varepsilon$ 3 Guarantees established in (Oh et al., 20 Oct 2025) are:

Any standard RL algorithm, when trained in $\|b_i-b_j\|_\infty = \max_{s\in S} |b_i(s) - b_j(s)| < \varepsilon$ 4, yields safe trajectories with probability one.
The filtered process preserves convergence and asymptotic optimality: the best policy in $\|b_i-b_j\|_\infty = \max_{s\in S} |b_i(s) - b_j(s)| < \varepsilon$ 5 achieves the same value as the optimal policy in the constrained policy class of $\|b_i-b_j\|_\infty = \max_{s\in S} |b_i(s) - b_j(s)| < \varepsilon$ 6.
Safety and performance objectives are thus fully separable under proper filtering.

Practical implementation of the safety filter can utilize either value-based monitors (e.g., barrier certificates) or rollout-based checks, with fallback policies to guarantee safe action selection even when exact computation of $\|b_i-b_j\|_\infty = \max_{s\in S} |b_i(s) - b_j(s)| < \varepsilon$ 7 is infeasible.

4. Comparative Tabulation of Filtering Methodologies

The following table synthesizes the key attributes of belief filtering, reachability filtering, and safety filtering.

Filter Type	Primary Domain	Filtering Principle
Belief-space Deduplication	Point-based POMDPs	Remove belief points within $\\|b_i-b_j\\|_\infty = \max_{s\in S} \|b_i(s) - b_j(s)\| < \varepsilon$ 8
Structured Reachability	Factored MDPs	Prune unreachable values via $\\|b_i-b_j\\|_\infty = \max_{s\in S} \|b_i(s) - b_j(s)\| < \varepsilon$ 9-ary exclusion constraints
Safety Filtering	SC-MDP/RL	Map actions to safe set, enforce categorical safety

Filtering can occur at the level of beliefs (information states), explicit state variables, or action constraints, and may be realized through sampling/deduplication, graph-based propagation, or environment wrappers.

5. Complexity–Quality Trade-offs and Practical Impact

Across methodologies, filtering introduces an explicit trade-off between computational tractability and retained solution quality:

Belief filtering converts a point-based backup of complexity $\varepsilon$ 0 to one with $\varepsilon$ 1 beliefs, often with empirical improvements in policy quality due to the elimination of spurious near-duplicates (Hsu, 2021).
Structured reachability analysis' cost is parametrized by the exclusion constraint size $\varepsilon$ 2: while binary constraints are tractable for large models, $\varepsilon$ 3-ary constraints provide more precise MDP reductions at higher cost (Boutilier et al., 2013).
Safety filtering, when least-restrictive, introduces negligible computational overhead but provides provable guarantees of safety with no asymptotic cost in optimality, fundamentally altering the perceived trade-off in safe RL (Oh et al., 20 Oct 2025).

All methods require careful calibration of filtering parameters (thresholds, constraint sizes, or approximation quality) to avoid over- or under-pruning relevant regions of the domain.

6. Limitations, Best-Case Conditions, and Integrations

Filtering methods are most effective when the following conditions are satisfied:

The domain inherently generates a high density of redundant or unreachable elements (states, beliefs, actions), as in structured POMDPs or large factored MDPs.
The filtering method aligns with the solver’s architecture (e.g., point-based backup in the case of belief-space filtering, or DBN representations in structured reachability).
In safety filtering, the assumption of the existence (or accurate approximation) of the maximal safe set $\varepsilon$ 4 and a measurable perfect filter $\varepsilon$ 5.

These filters integrate seamlessly with abstraction and aggregation methods:

After reachability filtering, irrelevant variables can be removed (static abstraction), and value/policy iteration can exploit the reduced state space.
Filtering can be used as a pre-processing step or as an online environmental wrapper to yield MDPs that are smaller or categorically safe, respectively.

A plausible implication is that advances in efficient approximate filtering (both in belief and structured space), as well as improved constructions of action-level filters, will remain central to the scalability and reliability of MDP- and RL-based planning systems.