- The paper introduces a hybrid framework that unites explainable policy abstraction with formal verification to identify RL safety violations.
- It employs Comprehensible Abstract Policy Summarization (CAPS) to generate human-friendly graphs for verifying probabilistic temporal logic specifications.
- A risk-aware falsification strategy prioritizes exploration in high-risk state regions, providing PAC guarantees to enhance safety in critical applications.
Verification-Guided Falsification for Safe RL via Explainable Abstraction and Risk-Aware Exploration
The paper introduces a novel hybrid framework aimed at enhancing the safety assurances of reinforcement learning (RL) policies implemented in high-stakes environments, such as autonomous driving and healthcare. By integrating explainability, formal verification, and risk-guided falsification, the authors present a comprehensive approach for identifying potential safety violations in RL policies more effectively than existing methods.
Reinforcement learning has achieved significant successes in a variety of simulated and game environments. However, its applicability in real-world safety-critical domains remains constrained due to the challenges of ensuring policy safety and interpretability simultaneously. Traditional safe RL methods often prioritize constraint adherence during the learning phase, while explainable RL (XRL) focuses on making the policies intelligible to human users. This paper posits that these two objectives must be jointly addressed to provide robust safety guarantees and introduces a framework that leverages interpretable abstraction along with formal verification to achieve this integration.
The key innovation of the proposed framework lies in its use of Comprehensible Abstract Policy Summarization (CAPS) to construct an interpretable abstraction of the RL policy. CAPS graphs are derived from offline trajectories and yield human-friendly summaries that are compatible with formal verification tools like the Storm probabilistic model checker. The paper elucidates how CAPS enables the abstraction of RL policies into directed graphs, which can be used to verify probabilistic temporal logic (e.g., PCTL) specifications. Should the Storm model checker identify a safety violation, the framework generates an interpretable trace explaining the policy's failure, augmenting standard verification with actionable insights for policy debugging.
A significant aspect of the research is its novel falsification strategy that prioritizes exploration in high-risk, underrepresented state regions when verification alone is insufficient. The system incorporates a risk critic to draw estimates during model checking, allowing for directed falsification when the abstraction omits critical transitions or lacks coverage. This risk-aware strategy, combined with ensemble-based epistemic uncertainty, facilitates the discovery of a broader and more nuanced set of potential safety violations than uncertainty-based and fuzzy-based search methods. The paper provides Probably Approximately Correct (PAC) guarantees, reinforcing the method's potential to uncover imprecisely represented or entirely unnoticed safety violations.
The authors also discuss a lightweight runtime safety shield that switches the current policy to a safer fallback when encountering high-risk situations. This mechanism ensures prompt intervention and mitigates undesirable outcomes without necessitating extensive policy retraining.
Empirical evaluations in both standard Mujoco-based and medical simulation environments demonstrate the effectiveness of this framework. The approach not only identifies a significant number of diverse and novel safety violations but also showcases the advantage of risk-guided strategies in exploring uncharted portions of the state space.
One limitation noted is the reliance on reachability guarantees in domains with delayed safety effects, such as continuous glucose monitoring in insulin dosing for diabetes patients. Here, the lack of immediate feedback on safety constraints presents a challenge for constructing reliable abstractions. Extending CAPS to support image-based observation spaces or to accommodate temporally-sensitive domains could address this issue.
In summary, the framework proposed by this paper offers a significant advancement in the safety assurance of RL policies by uniting three critical components—explainable policy abstraction, formal model verification, and risk-guided falsification—in a unified approach. This research not only contributes to developing safer RL deployments but also provides a blueprint for ongoing improvements in AI safety, interpretability, and practical implementation in high-stake environments.