Generalizability of HarmRLVR to Closed-Source Models

Determine whether the findings demonstrating rapid reversal of safety alignment via Reinforcement Learning with Verifiable Rewards using Group Relative Policy Optimization (HarmRLVR) on open-source large language models generalize to closed-source large language models such as OpenAI’s GPT series and Google’s Gemini.

Background

The paper introduces HarmRLVR, a Reinforcement Learning with Verifiable Rewards (RLVR) attack that uses GRPO with prompt-only harmful data and a verifiable harmfulness reward to reverse safety alignment in open-source models from Llama, Qwen, and DeepSeek. Experiments show high attack success rates (up to 96.01%) and preservation of utility across multiple benchmarks.

While results are robust across several open-source models, the authors explicitly note uncertainty regarding whether these findings extend to closed-source systems. Verifying this generalizability is important for understanding the broader risk landscape and for developing defenses that address both open- and closed-source ecosystems.

References

Third, our experiments are conducted on open-source models, while the generalizability of our findings to closed-source models such as GPT and Gemini remains to be verified.

— HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment (2510.15499 - Liu et al., 17 Oct 2025) in Limitations

Generalizability of HarmRLVR to Closed-Source Models

Background

References

Related Problems