Enhancing RL Safety with Counterfactual LLM Reasoning (2409.10188v1)

Published 16 Sep 2024 in cs.LG

Abstract: Reinforcement learning (RL) policies may exhibit unsafe behavior and are hard to explain. We use counterfactual LLM reasoning to enhance RL policy safety post-training. We show that our approach improves and helps to explain the RL policy safety.