Mechanism behind RL-induced increases in training prompt likelihood
Determine the specific mechanism by which Proximal Policy Optimization (PPO) post-training in the Open-Reasoner-Zero model increases the likelihood of its reinforcement learning training prompts relative to the Qwen 2.5 base model, thereby contributing to the memorization/regurgitation of training prompts despite the RL objective not explicitly optimizing sequence likelihoods.
References
The results show that RL training induces many of the training prompts to increase in likelihood. It is not immediately clear to us what exact mechanism is driving this increase in likelihood and leave this as a future exciting research direction.
— Extracting alignment data in open models
(2510.18554 - Barbero et al., 21 Oct 2025) in Section 5, Large scale extraction of RL data