Effect of Inoculation Prompting under on-policy reinforcement learning
Determine the effect of applying Inoculation Prompting during on-policy reinforcement learning for large language models. Specifically, investigate training where, at each step, prompts x are presented, responses y are sampled from the policy model, and the policy model is trained on the modified (x′, y) pairs in which the prompts explicitly request the undesired behavior, and characterize the resulting impact on model behavior when evaluated under neutral prompts.
References
We only test IP in the setting of supervised fine-tuning on demonstration data. We leave testing the effect of IP when applied to on-policy reinforcement learning to future work.
                — Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment
                
                (2510.05024 - Wichers et al., 6 Oct 2025) in Section 6: Limitations