Dice Question Streamline Icon: https://streamlinehq.com

Cause of pass@k gap narrowing in RL post-training

Ascertain whether the observed narrowing of pass@k performance gaps between reinforcement-learning–post-trained large language models and their base models is explained, at least in part, by evaluating and conducting RL training on tasks for which base models already achieve high pass@k due to pretraining, thereby leaving RL with limited incentive to teach genuinely new skills.

Information Square Streamline Icon: https://streamlinehq.com

Background

A recurring critique of RL post-training for LLMs is that it seemingly only reranks model outputs, as pass@k gaps shrink when larger k is allowed. The authors argue this may be a measurement artifact driven by training and evaluating on tasks that base models already solve with high pass@k, likely due to pretraining exposure, which would deprive RL of the incentive to learn new abilities.

They introduce a controlled string-transformation framework to isolate compositional skill acquisition and present evidence that RL significantly improves performance on harder compositional problems where base models have near-zero pass@k, suggesting the need to determine the causal role of task difficulty and pretraining in the observed pass@k gap behavior.

References

We conjecture that this observation arises, at least in part, from evaluating and RL training on tasks where base models already achieve high pass@$k$, possibly due to pretraining on similar tasks that is beyond the control of most academic researchers; thus RL has little incentive to learn a skill that the base model already has.

From $f(x)$ and $g(x)$ to $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones (2509.25123 - Yuan et al., 29 Sep 2025) in Section 1, Introduction