Reinforcement Learning on Pre-Training Data (RLPT)
- RLPT is a training paradigm that leverages reinforcement learning objectives on unlabeled data to autonomously improve reasoning and generalization in language models.
- It employs autoregressive and middle segment reasoning methods to predict and validate text continuations using intrinsic, self-supervised rewards.
- Empirical results demonstrate that RLPT yields scalable performance gains over supervised baselines across diverse benchmarks and reasoning tasks.
Reinforcement Learning on Pre-Training Data (RLPT) refers to the direct application of reinforcement learning (RL) objectives on unlabeled pre-training data, enabling generalizable reasoning and performance improvements in large-scale neural networks without the need for human-annotated feedback. This paradigm is motivated by the limitations in supervised scaling as the availability and quality of labeled text plateau, and it proposes to exploit the vast quantities of unsupervised pre-training data for scalable RL training. RLPT establishes a unified framework where models autonomously explore and optimize over the space of pre-training trajectories, deriving the reward signal directly from the intrinsic structure of the data—most commonly by rewarding accurate next-segment predictions conditioned on prior context. The resulting models exhibit enhanced reasoning ability, advanced generalization, and favorable scaling behaviors, providing a competitive foundation for further reinforcement learning with external objectives (Li et al., 23 Sep 2025).
1. Concept and Motivation
The RLPT paradigm is designed to address the diminishing returns of scaling LLMs purely via supervised next-token prediction, due to the finite growth rate of high-quality, human-annotated data. Rather than relying exclusively on RLHF (reinforcement learning from human feedback) or RLVR (reinforcement learning with verifiable rewards), both of which require constructed or curated reward signals, RLPT leverages the structure of the pre-training corpus to synthesize its own self-supervised reward. Specifically, RLPT frames learning as a next-segment reasoning task: for a given segment of context from a pre-training corpus, the policy aims to generate the most accurate continuation(s) and is rewarded according to a programmatic criterion (e.g., prefix matching with the ground truth). This approach facilitates the autonomous exploration of reasoning strategies across the full breadth of pre-training data, with verifiable rewards and without external annotation bottlenecks (Li et al., 23 Sep 2025).
2. Technical Formulation and Training Procedure
In the canonical RLPT setup, a raw text sample is divided into sequential segments . At each step , the model is trained to generate the next segment , given the preceding context . Two variants are implemented:
- Autoregressive Segment Reasoning (ASR): Predict from (standard next-segment prediction).
- Middle Segment Reasoning (MSR): Predict from and , requiring prediction within extended context.
For each prediction, a generative reward model derives a binary reward, based on whether the predicted segment, , is a valid prefix of the ground-truth (using byte sequence or token-boundary prefix matching):
where is the model output and is the reward function. The overall RLPT objective for model parameters combines both reasoning objectives:
with balancing the objectives. Training is performed with on-policy gradient methods (such as GRPO), using a mini-batch of samples and multiple rollouts per prompt, maximizing expected reward over possible continuations (Li et al., 23 Sep 2025).
3. Distinction from RLHF, RLVR, and Other Paradigms
A fundamental distinction of RLPT is its reward source—derived directly from natural textual structure—compared to RLHF (which depends on human preference annotation) or RLVR (which requires explicit, reference-based correctness). RLPT rewards the policy for accurate segment prediction, obviating the need for external annotation or evaluation and making it suitable for scaling across the entire pre-training data distribution. This facilitates autonomous and scalable training, encouraging models to generalize reasoning abilities across a broader contextual landscape. The RLPT objective also differs in its explicit use of reasoning traces and “rollouts,” moving beyond pure supervised sequence modeling toward a more flexible, exploratory policy learning paradigm (Li et al., 23 Sep 2025).
4. Empirical Results and Scaling Behavior
Large-scale experiments with RLPT, particularly on models such as Qwen3-4B-Base, demonstrate substantial and consistent improvements across general-domain and mathematical reasoning benchmarks. Representative gains include:
Model | Dataset | Supervised Baseline | RLPT Score | Absolute Gain |
---|---|---|---|---|
Qwen3-4B-Base | MMLU | 77.8 | 80.8 | +3.0 |
Qwen3-4B-Base | MMLU-Pro | 59.7 | 64.8 | +5.1 |
Qwen3-4B-Base | GPQA-Diamond | 31.3 | 39.4 | +8.1 |
Qwen3-4B-Base | KOR-Bench | 50.7 | 56.7 | +6.0 |
Qwen3-4B-Base | AIME24 (math) | - | - | +6.6 (Pass@1) |
Qwen3-4B-Base | AIME25 (math) | - | - | +5.3 (Pass@1) |
Scaling curves fit to a power-law relation with compute, showing that additional training resources yield continued performance improvements—a crucial property for data- and compute-intensive LLM training. These effects are reproduced across other model scales (Qwen3-8B-Base, Llama-3.2-3B-Base), underscoring RLPT's robustness and generality (Li et al., 23 Sep 2025).
5. Generalization and Reasoning Capability
The self-supervised next-segment reasoning structure of RLPT facilitates the emergence of latent reasoning skills. By encountering and being rewarded for correct—but not necessarily “memorized”—continuations, the model is incentivized to discover and utilize general, compositional reasoning strategies, improving its capacity for in-context understanding. Inclusion of both ASR and MSR objectives diversifies the set of reasoning paths explored during training, enabling the model to handle varied context lengths and types of reasoning queries. Empirically, this manifests as greater robustness to domain and task variation, as measured on diverse benchmarks, and as improved performance in follow-on tasks such as RLVR (reinforcement learning with verifiable rewards) (Li et al., 23 Sep 2025).
6. Connections, Limitations, and Future Directions
RLPT complements and extends self-supervised scaling by providing an RL-based mechanism that scales naturally with unlabeled data. The framework is also extensible: segmentation units need not be sentences—they could be atomic reasoning steps or subproblems inferred by the model, potentially yielding further performance gains. Refinements in the reward function (e.g., more nuanced prefix matching or alternative correctness criteria) may improve training stability and outputs. Potential future work includes combining RLPT with test-time scaling approaches (e.g., chain-of-thought prompting), domain-adaptive objectives, or hybrid reward functions. Open challenges include optimal design of segmentations, reward noise mitigation at scale, and rigorous evaluations of downstream generalization, especially in out-of-distribution tasks and under long-range context (Li et al., 23 Sep 2025).
7. Broader Impact and Research Implications
RLPT introduces a training-time scaling paradigm that enables reinforcement learning over pre-training data without human-annotated feedback. This fosters more autonomous, scalable, and generalizable LLMs and provides a solid foundation for continued RL optimization (e.g., RLVR). By exploiting the structure of unlabeled corpora, RLPT has the potential to extend the data-efficient frontier of LLMs beyond existing supervised and RLHF regimes. Its ability to autonomously encourage rich reasoning trajectories and robust generalization will be central as models are deployed in increasingly challenging reasoning, mathematical, and scientific settings (Li et al., 23 Sep 2025).