Dice Question Streamline Icon: https://streamlinehq.com

Mechanism behind RL-induced increases in training prompt likelihood

Determine the specific mechanism by which Proximal Policy Optimization (PPO) post-training in the Open-Reasoner-Zero model increases the likelihood of its reinforcement learning training prompts relative to the Qwen 2.5 base model, thereby contributing to the memorization/regurgitation of training prompts despite the RL objective not explicitly optimizing sequence likelihoods.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper measures the likelihood of PPO training prompts under the Qwen 2.5 base model and the Open-Reasoner-Zero post-trained model and finds that many prompts have higher likelihood after RL post-training. This behavior is unexpected because, unlike supervised fine-tuning, PPO does not directly optimize sequence likelihood.

Understanding the mechanism that raises prompt likelihood after RL post-training would clarify how alignment methods interact with memorization and could inform techniques to mitigate unintended regurgitation in RL-trained assistants.

References

The results show that RL training induces many of the training prompts to increase in likelihood. It is not immediately clear to us what exact mechanism is driving this increase in likelihood and leave this as a future exciting research direction.

Extracting alignment data in open models (2510.18554 - Barbero et al., 21 Oct 2025) in Section 5, Large scale extraction of RL data