Safety Filtering in Reinforcement Learning

Updated 7 March 2026

Safety filtering reinforcement learning is a method that integrates runtime and training-time safety constraints to ensure agents avoid critical failures.
It leverages techniques such as control barrier functions, predictive safety filters, and learned safety critics to project unsafe actions onto safe sets.
When embedded in RL training, safety filters enhance learning stability, sample efficiency, and transferability while preserving performance.

Safety filtering reinforcement learning refers to the augmentation of reinforcement learning (RL) algorithms with runtime or training-time mechanisms that enforce safety constraints by minimally modifying agent actions or trajectories to prevent critical failures. This approach enables RL to be applied in safety-critical domains—such as autonomous driving, robotics, and industrial control—by providing hard or probabilistic guarantees that the system remains within pre-specified safe sets, even during exploration or policy optimization. Techniques for safety filtering span analytically certified filters (e.g., control barrier functions), predictive control layers, learned critics estimating risk, demonstration-driven trajectory rejection, and modular filter-architecture paradigms.

1. Formalization and Theoretical Guarantees

The safety-filtering paradigm is formally grounded in the notion of a safety-critical Markov decision process (SC-MDP), in which a failure set $\mathcal F\subset\mathcal S$ is specified and admissible policies must satisfy $\Pr[s_t\notin\mathcal F,\ \forall t\mid s_0]=1$ for all allowed initial conditions. A safety filter is defined as a deterministic or stochastic map $\phi:\mathcal S\times\mathcal A\to\mathcal A$ that projects arbitrary candidate actions onto the safe action set $\mathcal A_{\rm safe}(s)$ at state $s$ —the set of controls whose support remains entirely within the maximal controlled-invariant set $\Omega^*$ (the largest forward-invariant safe set excluding failures).

The key theoretical result is the separation between safety and performance: for any RL (or control) algorithm wrapped with a sufficiently permissive safety filter in a filtered MDP $\mathcal M_\phi$ , the learning dynamics, convergence rates, and asymptotic optimal return coincide with those of the best safe policy for the original SC-MDP. Thus, the use of such filters enforces categorical (probability-1) safety during both training and deployment without inherent loss of task performance, provided the filter is not pathologically restrictive (Oh et al., 20 Oct 2025). All existing RL convergence and regret bounds carry over to the filtered environment.

2. Control-Theoretic Safety Filters: Control Barrier Functions

Control barrier functions (CBFs) are a principled class of safety filters for continuous and hybrid dynamical systems, formalizing the safe set $\mathcal C=\{x\in\mathbb R^n\,|\,h(x)\ge0\}$ for a smooth function $h(x)$ . For control-affine systems $\dot x=f(x)+g(x)u$ , safety is enforced by solving a quadratic program (QP) that projects the RL-proposed action $u_{\rm RL}$ to the nearest $u$ satisfying the CBF constraint: $L_fh(x) + L_gh(x) u + \alpha\bigl(h(x)\bigr) \ge 0, \qquad u_{\min}\le u\le u_{\max}$ where $\alpha$ is a class- $\mathcal K$ function. For higher relative-degree safety constraints, exponential CBF (ECBF) recursions are used. If the QP is infeasible due to actuator saturation or unreachable constraints, slack variables or fallback strategies (such as saturating the control and simulating the safety set evolution) are employed to preserve feasibility (Hailemichael et al., 2023).

CBF-based filters are guaranteed to ensure forward invariance of $\mathcal C$ at all times, yielding provable runtime guarantees such as no collisions in RL-based adaptive cruise control systems. In complex settings with uncertainties, disturbance-observer-augmented CBFs or robust CBF variants further tighten the constraints using indicator or worst-case disturbance bounds (Cheng et al., 2022, Emam et al., 2021).

3. Model-Free and Data-Driven Safety Critics

Safety filtering can be achieved model-free by learning an explicit safety critic (i.e., $Q_{\rm safe}$ ), which estimates the discounted probability of catastrophic failure under a candidate policy. The safety critic is parameterized and trained via temporal-difference recursion, often using a truncated discount factor ( $\gamma_{\rm safe}\ll1$ ) to focus on near-term risk. The resulting safety filter masks out any actions which, under the current critic, lead to failure probabilities exceeding a threshold $\epsilon$ . This yield a policy-projection operator that ensures, under technical coverage assumptions, that filtered trajectories never violate the prescribed safety budget (Srinivasan et al., 2020).

Recent advances show that model-free Q-learning with carefully designed reward shaping, penalizing entry to unsafe or irrecoverable regions with large negative rewards, produces a safety critic whose value is positive inside the safe set and negative elsewhere. The action filter then blocks any action with $Q_{\rm safe}(x,a)$ below a theoretically derived threshold, allowing seamless combination with arbitrary RL agents without requiring model knowledge or retraining of the nominal control policy (Sue et al., 2024).

4. Predictive, Optimization-Based, and Modular Filters

Predictive safety filters (PSF), inspired by model-predictive control (MPC), use system models (analytical or data-driven) to solve a constrained optimization—typically over a receding horizon—to ensure future state and input constraints are satisfied under model and uncertainty bounds. The PSF solves, at every step, for a minimally modified action sequence closest to the proposed RL trajectory but strictly avoiding violation of safety sets and enforcing terminal invariance constraints. PSFs can accommodate model uncertainty, can update beliefs as new data is collected, and can be wrapped "outside" any RL loop for out-of-the-box integration (Wabersich et al., 2018, Selim et al., 2022, Vaaler et al., 2023).

Modular filter architectures, such as discriminating hyperplane filters, learn a parameterized affine map $(a(x),b(x))$ at each state defining a halfspace of admissible actions and project RL-proposed actions via convex QP. These can be learned via supervised learning from trajectories or reinforcement learning, enabling task-agnostic, reusable filters that generalize across new policies or reward functions without re-training (Lavanakul et al., 2024). This modularization decouples safety and performance, increasing code and model reusability.

5. Training-Time Safety Filtering and Learning Dynamics

Embedding safety filters during RL training—rather than applying them only at deployment—significantly improves learning stability, safety, and sample efficiency. Key mechanisms include: (a) filtering all training actions through the safety filter, so that the agent always experiences the safety-enforced dynamics; (b) shaping the learning reward with a penalty on filter interventions to encourage the policy to internalize constraint avoidance; and (c) ensuring safe episode resets (starting only from safe initial states certified by the filter). Such integration prevents "over-reliance" on the filter (which can lead to chattering or policies that operate at the safety margin), accelerates convergence, and yields smoother control (Bejarano et al., 2024, Yang et al., 16 Oct 2025).

For CBF-based filters, training-time filtering and CBF-based reward shaping can be combined, and closed-form projection can be used in high-throughput simulators. This hybrid approach enables the RL policy to internalize safety requirements, so that at deployment the filter becomes largely inactive (Yang et al., 16 Oct 2025).

6. Data-Driven, Trajectory-Based, and Demonstration Filtering

Safety filtering can also leverage data-driven or demonstration-based methods:

Confidence-based filters use probabilistic or set-based system identification (e.g., Gaussian processes or regression) to compute high-probability confidence intervals on the system's next state and only allow actions that stay within ambiguous-free regions, typically via robust or pessimistic optimization (Curi et al., 2022).
Demonstration-based filters (e.g., DTW-based episode filtering) maintain archives of safe and unsafe trajectories. The current agent trajectory is compared at runtime via dynamic time warping to both sets; if the trajectory aligns closer to unsafe demonstrations, it is blocked or terminated, and a severe penalty is injected into the RL buffer. Such methods significantly reduce catastrophic failures during learning, and are compatible with both off-policy and on-policy agents (Correia et al., 2023).
Latent-space safety filtering applies reachability- or CBF-style certificates in a learned representation, often using smoothness-promoting penalty terms on the margin function to address the lack of gradient informativeness observed with saturated classifier-derived safety boundaries (Nakamura et al., 23 Nov 2025).

7. Empirical Performance and Domain-Specific Impact

Across diverse domains—autonomous vehicles, marine navigation, drone flight, robotic locomotion, and simulated control benchmarks—safety-filtered RL has achieved:

Strict safety guarantees: zero-collision training and deployment across test suites, e.g., adaptive cruise control, marine navigation, or Safety Gymnasium (Hailemichael et al., 2023, Vaaler et al., 2023, Oh et al., 20 Oct 2025).
Minimal performance loss: properly designed filters yield identical or higher final task returns compared to unfiltered or reward-penalty baselines, with provable asymptotic optimality under sufficient permissiveness (Oh et al., 20 Oct 2025).
Accelerated learning: integrating safety filtering during training typically increases sample efficiency and stabilizes learning, lowering crash rates by orders of magnitude (Bejarano et al., 2024, Srinivasan et al., 2020).
Task-agnostic transfer: modular and decoupled filters, such as those based on discriminating hyperplanes or task-independent safe-policies, support reuse across tasks and MDPs without retraining (Miret et al., 2020, Lavanakul et al., 2024).
Seamless combination with any RL: safety filters can be composed with any model-free or model-based learning pipeline, on- or off-policy actor-critic methods, and in both continuous and discrete-action domains.

Table: Comparison of Select Safety Filtering Approaches

Approach	Safety Guarantee	Model Requirement	Integration
CBF-QP Filtering	Categorical, analytic	Analytical, affine	Any RL (hybrid)
Robust/Dist. Obs. CBF	Categorical, robust	Lipschitz, DOB/GP	Any RL
Q-learning Critic	Empirical (NB), pedal	Model-free	Plug-in/Any RL
PSF (Predictive Safety)	Prob./deterministic	MPC or data-driven	Any RL
Discrim. Hyperplane	Categorical, modular	None/learned	Plug-in/Any RL
DTW Demonstrations	Empirical, episode	Safe/unsafe demos	RL-agnostic

(NB: empirical safety for Q-learning critic relies on reward shaping, see (Sue et al., 2024))

This spectrum enables researchers to select filtering strategies best matched to system structure, available models, and safety/performance tradeoffs. As RL moves increasingly into mission- and safety-critical deployments, safety filtering constitutes a foundational tool for aligning learning dynamics with hard operational constraints and risk mitigation (Hailemichael et al., 2023, Oh et al., 20 Oct 2025, Bejarano et al., 2024).