Effect of Inoculation Prompting under on-policy reinforcement learning

Determine the effect of applying Inoculation Prompting during on-policy reinforcement learning for large language models. Specifically, investigate training where, at each step, prompts x are presented, responses y are sampled from the policy model, and the policy model is trained on the modified (x′, y) pairs in which the prompts explicitly request the undesired behavior, and characterize the resulting impact on model behavior when evaluated under neutral prompts.

Background

The paper introduces Inoculation Prompting (IP), a train-time technique that modifies training prompts to explicitly request an undesired behavior, aiming to prevent models from internalizing that behavior when evaluated with neutral prompts. All experiments in the paper apply IP within supervised fine-tuning on demonstration data and show reductions in undesired behaviors across multiple settings.

The authors suggest an extension of IP to an on-policy reinforcement learning (RL) regime—presenting prompts, sampling responses from the policy model, and training on modified prompt–response pairs—but explicitly note that they have not tested this setting. Understanding IP’s effect in on-policy RL remains unresolved and is important for generalizing the method beyond supervised fine-tuning.

References

We only test IP in the setting of supervised fine-tuning on demonstration data. We leave testing the effect of IP when applied to on-policy reinforcement learning to future work.

— Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (2510.05024 - Wichers et al., 6 Oct 2025) in Section 6: Limitations

Effect of Inoculation Prompting under on-policy reinforcement learning

Background

References

Related Problems