Generalization from RL training distribution to held-out test sets
Determine how large language models trained via reinforcement learning generalize from the in-distribution training prompts to held-out test sets. Characterize the relationship between in-distribution validation scaling curves and downstream generalization performance, and identify algorithmic factors that govern generalization under multi-epoch RL training.
References
This still leaves the question of how well the LLM would generalize from the training distribution to held out test sets. While a full characterization of generalization is beyond the scope of our work, we do observe correlation between in-distribution validation and downstream generalization performance.
— The Art of Scaling Reinforcement Learning Compute for LLMs
(Khatri et al., 15 Oct 2025) in Section 6 (Discussion) — Generalization bullet