Safety Filtering
- Safety filtering is the process of vetting control actions to ensure that systems remain within certified safe sets using formal safety specifications and real-time interventions.
- Modern methodologies employ control barrier functions, Hamilton–Jacobi reachability, and robust optimization to minimally modify actions while guaranteeing safety under uncertainty.
- Integration with reinforcement learning and simulation-based approaches enhances system robustness, ensuring sample-efficient training and minimal safety violations in practical applications.
Safety filtering is the algorithmic process by which control inputs, decisions, or outputs—whether generated by human agents, autonomous controllers, or learned policies—are vetted, modified, or overridden in real time to ensure compliance with formal safety specifications. Safety filters impose hard or probabilistic guarantees that the system will remain within a certified safe set, even amid uncertainty, perception limits, or adversarial conditions. Modern realizations span control barrier functions, Hamilton–Jacobi reachability, distributionally robust optimization, model-predictive safety, and data-driven approaches. Safety filtering is operationalized as an intervention layer between an unverified or performance-oriented "primary" controller and the plant or environment, minimally altering decisions only when necessary to guarantee safety.
1. Formal Definitions and Filtering Logic
Safety filtering can be formalized across continuous, discrete, deterministic, and stochastic systems:
- Run Time Assurance (RTA): At each timestep, a monitor evaluates the proposed control action given the current state . If maintains safety, it is executed; otherwise, a pre-verified backup controller takes over (Hobbs et al., 2021).
- Safety-Critical MDPs: In safety-critical Markov decision processes (SC-MDPs), a filter maps any proposed action to the maximal safe set, yielding a filtered policy that never violates categorical safety constraints (Oh et al., 20 Oct 2025).
- Control Barrier Function (CBF) Filters: CBFs encode forward invariance via inequalities such as . A safety filter solves a quadratic program projecting the proposed action onto the set of feasible (safe) actions (Hobbs et al., 2021, Smaili et al., 18 Dec 2025).
- Hamilton-Jacobi (HJ) Reachability Filters: Based on the value function solving an HJI variational inequality, the filter restricts actions to those for which , ensuring infinitesimal step-wise safety (Borquez et al., 2023).
Filtering logic may be either switching (hard override to backup policy or minimal control) or projection-based (solving optimization to minimally alter control, often via QPs).
2. Key Methodologies and Design Principles
Contemporary safety filters employ diverse methodologies:
- Explicit and Implicit CBF-QP Filtering: Explicit filters solve QPs at each timestep; implicit filters use reachability or backup simulations to characterize safe operational regions (Hobbs et al., 2021, Hailemichael et al., 2023).
- Distributionally Robust Optimization (DRO): Motion planning under uncertainty uses sample-based predictions and constructs safe halfspaces via DRO with Conditional Value-at-Risk (CVaR) metrics. For each obstacle, safety is enforced by bounding the worst-case risk in a Wasserstein-ball ambiguity set (Safaoui et al., 2023).
- Spectral Safety Filtering: EigenSafe learns spectral certificates (dominant eigenfunctions of the Bellman safety operator), using threshold-switching between reference and backup policies; it is suited for stochastic systems where classical reachability collapses the safe set (Jang et al., 22 Sep 2025).
- Physics-based Simulation Filters: Manipulation under parameter uncertainty combines dense nominal rollouts with sparse, parallelized evaluation at critical states, using generalized factor-of-safety metrics and Monte Carlo integration for risk assessment (Johansson et al., 16 Sep 2025).
- Perception-limited and Smooth Filtering: With limited sensing, filters gate the activation of safety constraints, employing differentiable perception gates or penalty-based relaxation to yield smooth, high-order compatible safety actions (Smaili et al., 18 Dec 2025).
- Latent-space Filtering: For high-dimensional observations (e.g., images), filters operate in learned latent spaces, adapting HJ reachability and CBF concepts; the quality of the latent margin function (smoothness, calibration) critically affects filter efficacy (Nakamura et al., 23 Nov 2025).
Table: Core Filtering Schemes (Editor’s term)
| Filtering Paradigm | Core Guarantee | Representative Equation/Logic |
|---|---|---|
| CBF-QP (explicit) | Forward invariance | |
| HJ Reachability | Maximal safe set | , |
| DRO-based (CVaR) | Prob. risk threshold | |
| Spectral (EigenSafe) | Safety probability | , |
| Physics-based | Parametric FOS |
3. Integration with Learning-Based Policies and RL
Safety filters are increasingly intertwined with reinforcement learning and learned controllers:
- Plug-and-Play Models: Model-free filters learned via Q-learning use a safety Q-function to filter nominal actions via thresholding, without requiring system models; theoretical results guarantee forward invariance under optimality (Sue et al., 2024).
- Training-Time Safety Filtering: Embedding the safety filter at training—not only at deployment—enables the RL policy to adapt to the certified filter, improving sample efficiency, reducing chattering, and maintaining hard safety guarantees (Bejarano et al., 2024).
- Permissive Filtering: Theorem established in (Oh et al., 20 Oct 2025) proves that a maximally permissive safety filter allows RL agents to achieve the same asymptotic performance as unconstrained learning, provided all unsafe actions are simply overridden or projected.
- CBF-RL and LatentCBF: Enforcing formal CBF constraints in RL rollouts or in learned latent spaces leads to policies that internalize safety, enabling deployment without online filters and supporting high-dimensional and visuomotor tasks (Yang et al., 16 Oct 2025, Nakamura et al., 23 Nov 2025).
4. Extensions: Perception, Uncertainty, and Semantics
Modern safety filters adapt to real-world perception constraints, environmental uncertainty, and semantic grounding:
- Poisson-based Safety Functions: Solving Poisson’s equation on an occupancy map yields globally smooth safety sets and gradient fields used for CBF construction and online filtering; performance validated on cluttered hardware (Bahati et al., 11 May 2025).
- Path-Consistent Filtering: For diffusion policies, path-consistent braking uses trajectory-based reachability checks, ensuring safe deployment without “warping” off policy-consistent paths (Römer et al., 9 Nov 2025).
- Language-Conditioned Filtering: Safety constraints derived from natural language are parsed via LLMs into machine-readable specifications, grounded with perception modules and enforced in real time via MPC-based safety filters (Feng et al., 8 Nov 2025).
- Content Safety in LLMs: In LLMs, safety filtering spans input and output stages, leveraging classifiers, adversarial detectors, and context-aware moderation systems to dramatically reduce jailbreak attack success rates (Xin et al., 30 Dec 2025). CultureGuard extends content safety to multilingual and culturally distinct datasets and filters via a hierarchical, adaptation+translation+filtering pipeline (Joshi et al., 3 Aug 2025).
- Manipulation Under Uncertainty: Physics-based safety filters leverage high-fidelity simulation and sparse MC evaluation to robustly filter actions in uncertain environments, with a scalable pipeline amenable to real-world robotic manipulation (Johansson et al., 16 Sep 2025).
5. Theoretical Guarantees and Empirical Validation
Safety filtering frameworks offer strong mathematical guarantees and exhibit robust empirical performance:
- Forward Invariance: Classical CBFs and penalty-based smooth filters guarantee that the system state remains inside the safe set for all time via Nagumo's theorem and forward invariance principles (Smaili et al., 18 Dec 2025).
- Probabilistic Guarantees: DRO-based filters provide explicit risk bounds (CVaR, Wasserstein-ball) on collision probability, with LP reformulations enabling real-time tractability (Safaoui et al., 2023).
- Spectral Safety Probability: EigenSafe certifies the asymptotic safety probability via dominant operator eigenpairs, supporting stochastic processes and adaptive thresholding (Jang et al., 22 Sep 2025).
- Sample Efficiency and Performance: RL agents trained with online safety filtering learn certified behaviors more efficiently, avoid training-time and deployment violations, and match or exceed the performance of unconstrained RL (Bejarano et al., 2024, Oh et al., 20 Oct 2025, Yang et al., 16 Oct 2025).
- Empirical Benchmarks: Across tasks—adaptive cruise control (Hailemichael et al., 2023), high-dimensional manipulation (Nakamura et al., 23 Nov 2025), quadruped and humanoid navigation (Bahati et al., 11 May 2025, Yang et al., 16 Oct 2025)—safety filters yield zero or near-zero violation rates, minimal control corrections, and competitive or superior task completion rates.
6. Limitations, Open Challenges, and Future Directions
Safety filtering is subject to domain-specific constraints and ongoing research avenues:
- Approximation Quality: Model-free Q-function or learned latent margin functions require sufficient training and regularization to guarantee safety; suboptimal functions can admit violations (Sue et al., 2024, Nakamura et al., 23 Nov 2025).
- Smoothness vs. Reactivity: Classical barrier-based filters may exhibit nonsmooth switching; smooth perception-gate and penalty-based relaxations resolve this but may trade off reactivity near constraint boundaries (Smaili et al., 18 Dec 2025).
- Multilingual and Semantic Filtering: As safety-critical LLMs proliferate, explicit attention to cultural context, cross-lingual adaptation, and semantic disambiguation is necessary to avoid high false-positive/negative rates (Xin et al., 30 Dec 2025, Joshi et al., 3 Aug 2025).
- Safe Exploration vs. Conservatism: Over-conservative filters can hamper performance; permissive filters require accurate safe set identification, especially in high-dimensional or unknown environments (Oh et al., 20 Oct 2025).
- Data Coverage and Uncertainty: In physics-based and spectral filters, comprehensive exploration near safety boundaries and robust uncertainty quantification are essential for accuracy and practical performance (Johansson et al., 16 Sep 2025, Jang et al., 22 Sep 2025).
Future work targets adaptive filter tuning, hardware deployments, semantic understanding, and scalable filtering in complex, stochastic, and partially observable domains (Safaoui et al., 2023, Smaili et al., 18 Dec 2025, Joshi et al., 3 Aug 2025, Nakamura et al., 23 Nov 2025). There is ongoing interest in integrating multi-turn conversational history in LLM safety, data-driven robustification in RL-safe sets, and algorithmic harmonization between performance and formal safety guarantees.