Generalizability of HarmRLVR to Closed-Source Models
Determine whether the findings demonstrating rapid reversal of safety alignment via Reinforcement Learning with Verifiable Rewards using Group Relative Policy Optimization (HarmRLVR) on open-source large language models generalize to closed-source large language models such as OpenAI’s GPT series and Google’s Gemini.
References
Third, our experiments are conducted on open-source models, while the generalizability of our findings to closed-source models such as GPT and Gemini remains to be verified.
— HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment
(2510.15499 - Liu et al., 17 Oct 2025) in Limitations