Reinforcement Pre-Training (2506.08007v1)

Published 9 Jun 2025 in cs.CL

Abstract: In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for LLMs and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the LLMing accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance LLM pre-training.

Summary

The paper presents a novel reinforcement pre-training paradigm that reframes next-token prediction as a reasoning task using intrinsic rewards.
It leverages multiple reasoning trajectories per context and uses prefix matching rewards to improve prediction accuracy and generalization.
Experiments show that RPT outperforms standard approaches on challenging benchmarks and scales effectively with increased training compute.

This paper introduces Reinforcement Pre-Training (RPT), a novel paradigm for pre-training LLMs by reframing the traditional next-token prediction task as a next-token reasoning task trained using reinforcement learning (RL). Instead of directly predicting the next token, the model is incentivized to generate a chain-of-thought reasoning sequence before making its prediction. It receives a verifiable, intrinsic reward based on whether its predicted token matches the ground-truth next token from the pre-training corpus. This approach allows RL to be scaled to vast amounts of unannotated text data, addressing the scalability and generality limitations of current RL applications in LLMs, which often rely on domain-specific annotated data or costly human feedback.

Core Idea and Motivation

Current LLM pre-training primarily uses self-supervised next-token prediction. While RL has been effective for fine-tuning (e.g., RLHF for alignment, RLVR for specific skills), its application in pre-training has been limited by the need for annotated data or the risk of reward hacking with learned reward models. RPT aims to bridge this gap by using the pre-training corpus itself to provide verifiable rewards for a reasoning-augmented next-token prediction process.

The key advantages proposed for RPT are:

Scalability and Generality: Leverages abundant unannotated text data for general-purpose RL pre-training.
Reduced Reward Hacking: Uses rule-based rewards (correctness of prediction) which are less prone to hacking.
Improved Generalization: Encourages deeper understanding and reasoning patterns over rote memorization of token sequences.
Enhanced Prediction Accuracy: The internal reasoning process is akin to allocating more "thought" or computation per prediction, improving next-token prediction.

Methodology: Reinforcement Pre-Training (RPT)

Next-Token Reasoning Task: For any given context $x_{<t}$ from the training corpus, the model $\pi_{\theta}$ generates a chain-of-thought reasoning sequence $c_t$ followed by a prediction $y_t$ for the next token. The model's output is $o_t = (c_t, y_t)$ . This transforms the pre-training corpus into a large-scale set of reasoning problems.
Pre-Training with Reinforcement Learning:

RPT uses on-policy RL. For a context $x_{<t}$ , the LLM $\pi_{\theta}$ generates $G$ different "thinking trajectories" $\{o^i_t\}_{i=1}^{G}$ , where each $o^i_t = (c^i_t, y^i_t)$ .
- Reward Function (Prefix Matching): A reward $r^i_t$ is given for each trajectory. Let $\overline{x}_{\geq t}$ be the byte sequence of the ground-truth completion and $\overline{y}^i_t$ be the byte sequence of the prediction $y^i_t$ of length $l$ . The reward is 1 if $\overline{y}^i_t$ is an exact prefix of $\overline{x}_{\geq t}$ and its length $l$ aligns with a valid token boundary in the ground-truth completion. Otherwise, the reward is 0. This supports multi-token predictions and out-of-vocabulary tokens.
  
  $r^{i}_t = \begin{cases} 1 & \text{if } \overline{y}^i_t = \overline{x}_{\geq t}[1:l] \text{ and } l \in \mathcal{L}_{gt} \ 0 & \text{otherwise} \end{cases}$

* Objective Function: The model is trained to maximize the expected reward: $\mathcal{J}_\text{RPT}(\theta) = \mathbb{E}_{(x_{<t},x_{\geq t})\sim\mathcal{D},\ \{o^{i}_t\}_{i=1}^{G} \sim\pi_\theta(\cdot\mid x_{<t})}\left[r^{i}_t\right]$ .

Pre-Training Setup:
- Dataset: OmniMATH (4,428 competition-level mathematical problems and solutions).
- Data Filtering: A small proxy model (Deepseek-R1-Distill-Qwen-1.5B) calculates entropy for next-token predictions. Low-entropy (easy-to-predict) tokens are filtered out to focus training on challenging tokens.
- Base Model: Deepseek-R1-Distill-Qwen-14B.
- Training Framework: Implemented with verl library and vLLM for inference.
- RL Algorithm: GRPO, with learning rate $1 \times 10^{-6}$ , zero KL penalty, batch size 256 questions.
- Rollouts: $G=8$ responses sampled per question, temperature 0.8 for rollouts.
- Prediction Extraction: The sequence inside the last \boxed{} after a special token </think> is taken as the next-token prediction.
- Training Details: 8k training length, dynamic sampling after 500 steps, total 1,000 training steps for the main experiment.

Evaluation and Experiments

LLMing Performance:
- Evaluated on a held-out OmniMATH validation set (200 samples), categorized into easy, medium, and hard splits based on entropy thresholds.
- Results: RPT-14B consistently outperformed the R1-Distill-Qwen-14B baseline (both standard NTP and next-token reasoning modes) and the original Qwen2.5-14B across all difficulty levels. For instance, on the hard split, RPT-14B achieved 23.75% accuracy compared to 20.43% for R1-Distill-Qwen-14B (standard NTP) and 1.41% for R1-Distill-Qwen-14B (next-token reasoning without RPT). RPT-14B's average performance matched that of a larger R1-Distill-Qwen-32B model.
Model Easy Medium Hard

Qwen2.5-14B (NTP) 41.90 30.03 20.65

R1-Distill-Qwen-14B (NTP) 41.60 29.46 20.43

R1-Distill-Qwen-14B (Reasoning) 3.31 1.66 1.41

RPT-14B (Reasoning) 45.11 33.56 23.75
Scaling Properties:
- Next-token prediction accuracy of RPT was evaluated at various training steps (compute levels).
- Results: Accuracy consistently improved with increased training compute, following a power-law relationship $P(C) = \frac{A}{C^{\alpha}} + P^{*}$ with high $R^2$ values, indicating good fit and scalability.
Reinforcement Fine-Tuning with RPT:
- RPT models were continually fine-tuned with RLVR on a Skywork-OR1 dataset sample (256 training, 200 testing examples).
- Results: RPT-14B achieved higher performance both before (56.3% vs 51.2%) and after (58.3% vs 52.7%) RLVR fine-tuning compared to the baseline R1-Distill-Qwen-14B. Continual NTP training on the same data significantly degraded reasoning ability.
Model Before RL After RL

R1-Distill-Qwen-14B 51.2 52.7

~~~~+ Continual NTP training 10.7 13.0

RPT-14B 56.3 58.3
Zero-Shot Performance on End Tasks:
- Evaluated on MMLU-Pro and SuperGPQA benchmarks in reasoning mode.
- Results: RPT-14B outperformed R1-Distill-Qwen-14B (both NTP and reasoning modes) and even the larger R1-Distill-Qwen-32B (NTP mode) on both benchmarks.
Model SuperGPQA MMLU-Pro

R1-Distill-Qwen-14B (NTP) 32.0 48.4

R1-Distill-Qwen-32B (NTP) 37.2 56.5

R1-Distill-Qwen-14B (Reasoning) 36.1 68.9

RPT-14B (Reasoning) 39.0 71.1
Next-Token Reasoning Pattern Analysis:
- Compared reasoning patterns (e.g., hypothesis, deduction, breakdown) in RPT-14B (for next-token reasoning) and R1-Distill-Qwen-14B (for problem-solving) on OmniMATH.
- Results: RPT-14B showed significantly more "hypothesis" (161.8% more) and "deduction" (26.2% more) patterns, while problem-solving relied more on "breakdown." This suggests RPT elicits a different, more inferential reasoning process. Case studies illustrated the model's deliberative process, analyzing context, brainstorming, and weighing alternatives.

Model	Easy	Medium	Hard
Qwen2.5-14B (NTP)	41.90	30.03	20.65
R1-Distill-Qwen-14B (NTP)	41.60	29.46	20.43
R1-Distill-Qwen-14B (Reasoning)	3.31	1.66	1.41
RPT-14B (Reasoning)	45.11	33.56	23.75

Model	Before RL	After RL
R1-Distill-Qwen-14B	51.2	52.7
~~~~+ Continual NTP training	10.7	13.0
RPT-14B	56.3	58.3

Model	SuperGPQA	MMLU-Pro
R1-Distill-Qwen-14B (NTP)	32.0	48.4
R1-Distill-Qwen-32B (NTP)	37.2	56.5
R1-Distill-Qwen-14B (Reasoning)	36.1	68.9
RPT-14B (Reasoning)	39.0	71.1

Practical Implementation Considerations:

Computational Cost: RPT involves generating multiple ( $G$ ) trajectories (reasoning + prediction) for each training step, making it more computationally intensive than standard NTP. However, the paper argues this is akin to inference-time scaling applied at training time.
Base Model Choice: RPT was initialized from a model (Deepseek-R1-Distill-Qwen-14B) already possessing some reasoning capabilities. Starting from a standard base LLM might require more tuning or yield different insights.
Data Filtering: The entropy-based filtering of "easy" tokens is crucial for focusing RL on samples where reasoning is more beneficial. The choice of proxy model and entropy threshold for this filtering can impact performance.
Prompt Engineering: The prompt template used to elicit reasoning is important. The paper explored several variants (Appendix D), noting that clear prompts significantly improve initial performance. The main experiments used a specific template (v0).
Reward Design: While prefix matching was used, other reward designs (first-token matching, dense rewards) were explored and found to yield comparable performance, suggesting robustness in this aspect (Appendix A).
Hyperparameter Tuning: RL training is sensitive to hyperparameters (learning rate, batch size, PPO parameters, sampling temperature). The paper provides a set of hyperparameters used (Appendix B).

Pseudocode for RPT (Conceptual):

Initialize LLM parameters θ
For each training epoch:
  For each batch of contexts {x_<t} from pre-training corpus D:
    For each context x_<t in batch:
      // On-policy rollouts
      Sample G trajectories {o_j = (c_j, y_j)}_j=1^G from π_θ(· | x_<t)
        // c_j is reasoning, y_j is next-token prediction
      
      // Calculate rewards
      For each trajectory o_j:
        r_j = calculate_prefix_matching_reward(y_j, ground_truth_next_token x_t)
      
      // Store trajectories and rewards for policy update
      Add (x_<t, {o_j}, {r_j}) to PPO buffer
      
    // Update model parameters
    Update θ using an RL algorithm (e.g., GRPO) with data from PPO buffer
      Objective: Maximize Expected[r_j]

Conclusion and Future Work

RPT is presented as a promising new scaling paradigm that improves next-token prediction, enhances zero-shot reasoning on downstream tasks, and provides a better foundation for RL fine-tuning. Limitations include experiments primarily on a 14B model and a math-focused corpus. Future work includes:

Scaling to larger and more diverse corpora (general web text).
Scaling up training compute.
Establishing scaling laws specifically for RPT.
Integrating hybrid thinking to adaptively trigger next-token reasoning.
Investigating RPT training from standard base LLMs.

This approach effectively teaches the model to "think before it speaks" at the most fundamental level of LLMing – predicting the next token – by making the "thinking" process itself a target for optimization via reinforcement learning.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Ar_Douillard/status/1933526904771870863

https://twitter.com/qx_dong/status/1932269085636903286

https://twitter.com/tamaybes/status/1935497536485343482

https://twitter.com/realshojaei/status/1933156818479354297

https://twitter.com/fly51fly/status/1932551652869062706

https://twitter.com/TheTuringPost/status/1934946645634105654

Reinforcement Pre-Training (2506.08007v1)

Summary

Related Papers

Tweets

YouTube

HackerNews

Reddit