Verification-Guided Falsification for Safe RL via Explainable Abstraction and Risk-Aware Exploration (2506.03469v1)

Published 4 Jun 2025 in cs.AI and cs.LG

Abstract: Ensuring the safety of reinforcement learning (RL) policies in high-stakes environments requires not only formal verification but also interpretability and targeted falsification. While model checking provides formal guarantees, its effectiveness is limited by abstraction quality and the completeness of the underlying trajectory dataset. We propose a hybrid framework that integrates (1) explainability, (2) model checking, and (3) risk-guided falsification to achieve both rigor and coverage. Our approach begins by constructing a human-interpretable abstraction of the RL policy using Comprehensible Abstract Policy Summarization (CAPS). This abstract graph, derived from offline trajectories, is both verifier-friendly, semantically meaningful, and can be used as input to Storm probabilistic model checker to verify satisfaction of temporal safety specifications. If the model checker identifies a violation, it will return an interpretable counterexample trace by which the policy fails the safety requirement. However, if no violation is detected, we cannot conclude satisfaction due to potential limitation in the abstraction and coverage of the offline dataset. In such cases, we estimate associated risk during model checking to guide a falsification strategy that prioritizes searching in high-risk states and regions underrepresented in the trajectory dataset. We further provide PAC-style guarantees on the likelihood of uncovering undetected violations. Finally, we incorporate a lightweight safety shield that switches to a fallback policy at runtime when such a risk exceeds a threshold, facilitating failure mitigation without retraining.

Summary

The paper introduces a hybrid framework that unites explainable policy abstraction with formal verification to identify RL safety violations.
It employs Comprehensible Abstract Policy Summarization (CAPS) to generate human-friendly graphs for verifying probabilistic temporal logic specifications.
A risk-aware falsification strategy prioritizes exploration in high-risk state regions, providing PAC guarantees to enhance safety in critical applications.

Verification-Guided Falsification for Safe RL via Explainable Abstraction and Risk-Aware Exploration

The paper introduces a novel hybrid framework aimed at enhancing the safety assurances of reinforcement learning (RL) policies implemented in high-stakes environments, such as autonomous driving and healthcare. By integrating explainability, formal verification, and risk-guided falsification, the authors present a comprehensive approach for identifying potential safety violations in RL policies more effectively than existing methods.

Reinforcement learning has achieved significant successes in a variety of simulated and game environments. However, its applicability in real-world safety-critical domains remains constrained due to the challenges of ensuring policy safety and interpretability simultaneously. Traditional safe RL methods often prioritize constraint adherence during the learning phase, while explainable RL (XRL) focuses on making the policies intelligible to human users. This paper posits that these two objectives must be jointly addressed to provide robust safety guarantees and introduces a framework that leverages interpretable abstraction along with formal verification to achieve this integration.

The key innovation of the proposed framework lies in its use of Comprehensible Abstract Policy Summarization (CAPS) to construct an interpretable abstraction of the RL policy. CAPS graphs are derived from offline trajectories and yield human-friendly summaries that are compatible with formal verification tools like the Storm probabilistic model checker. The paper elucidates how CAPS enables the abstraction of RL policies into directed graphs, which can be used to verify probabilistic temporal logic (e.g., PCTL) specifications. Should the Storm model checker identify a safety violation, the framework generates an interpretable trace explaining the policy's failure, augmenting standard verification with actionable insights for policy debugging.

A significant aspect of the research is its novel falsification strategy that prioritizes exploration in high-risk, underrepresented state regions when verification alone is insufficient. The system incorporates a risk critic to draw estimates during model checking, allowing for directed falsification when the abstraction omits critical transitions or lacks coverage. This risk-aware strategy, combined with ensemble-based epistemic uncertainty, facilitates the discovery of a broader and more nuanced set of potential safety violations than uncertainty-based and fuzzy-based search methods. The paper provides Probably Approximately Correct (PAC) guarantees, reinforcing the method's potential to uncover imprecisely represented or entirely unnoticed safety violations.

The authors also discuss a lightweight runtime safety shield that switches the current policy to a safer fallback when encountering high-risk situations. This mechanism ensures prompt intervention and mitigates undesirable outcomes without necessitating extensive policy retraining.

Empirical evaluations in both standard Mujoco-based and medical simulation environments demonstrate the effectiveness of this framework. The approach not only identifies a significant number of diverse and novel safety violations but also showcases the advantage of risk-guided strategies in exploring uncharted portions of the state space.

One limitation noted is the reliance on reachability guarantees in domains with delayed safety effects, such as continuous glucose monitoring in insulin dosing for diabetes patients. Here, the lack of immediate feedback on safety constraints presents a challenge for constructing reliable abstractions. Extending CAPS to support image-based observation spaces or to accommodate temporally-sensitive domains could address this issue.

In summary, the framework proposed by this paper offers a significant advancement in the safety assurance of RL policies by uniting three critical components—explainable policy abstraction, formal model verification, and risk-guided falsification—in a unified approach. This research not only contributes to developing safer RL deployments but also provides a blueprint for ongoing improvements in AI safety, interpretability, and practical implementation in high-stake environments.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (5)

YouTube

Show All Videos