Generalization of preliminary training-dynamics analysis to long training runs

Determine whether the preliminary analysis of training dynamics for reinforcement learning post-training of large language models—covering the effects of total batch size, the decomposition into number of prompts versus generations per prompt, and the efficacy of focusing on intermediate-difficulty prompts with success probability pπ(x) ≈ 0.5—generalizes to substantially longer training runs.

Background

The paper presents a systematic preliminary analysis showing (i) an optimal total batch size at the transition between sublinear and linear generation-time scaling and (ii) the superior sample efficiency of focusing on prompts of intermediate difficulty (approximately 50% success under the current policy). These findings are validated across several models and datasets under a 2–3 day training horizon.

Because all experiments are constrained to relatively short runs, the authors explicitly acknowledge uncertainty about whether these findings persist over substantially longer training durations. Establishing this would clarify the stability and applicability of the proposed Prompt Curriculum Learning strategy and the identified batch-size regime in more realistic, long-horizon training settings.

References

Although we observe strong early-stage performance, it remains an open question whether our analysis in Section~\ref{sec:preliminary_investigation} would generalize to much longer training runs.

— Prompt Curriculum Learning for Efficient LLM Post-Training (2510.01135 - Gao et al., 1 Oct 2025) in Limitations, subsection "Limited training horizon"

Generalization of preliminary training-dynamics analysis to long training runs

Background

References

Related Problems