Reinforcement Pre-Training (2506.08007v1)
Abstract: In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for LLMs and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the LLMing accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance LLM pre-training.
Summary
- The paper presents a novel reinforcement pre-training paradigm that reframes next-token prediction as a reasoning task using intrinsic rewards.
- It leverages multiple reasoning trajectories per context and uses prefix matching rewards to improve prediction accuracy and generalization.
- Experiments show that RPT outperforms standard approaches on challenging benchmarks and scales effectively with increased training compute.
This paper introduces Reinforcement Pre-Training (RPT), a novel paradigm for pre-training LLMs by reframing the traditional next-token prediction task as a next-token reasoning task trained using reinforcement learning (RL). Instead of directly predicting the next token, the model is incentivized to generate a chain-of-thought reasoning sequence before making its prediction. It receives a verifiable, intrinsic reward based on whether its predicted token matches the ground-truth next token from the pre-training corpus. This approach allows RL to be scaled to vast amounts of unannotated text data, addressing the scalability and generality limitations of current RL applications in LLMs, which often rely on domain-specific annotated data or costly human feedback.
Core Idea and Motivation
Current LLM pre-training primarily uses self-supervised next-token prediction. While RL has been effective for fine-tuning (e.g., RLHF for alignment, RLVR for specific skills), its application in pre-training has been limited by the need for annotated data or the risk of reward hacking with learned reward models. RPT aims to bridge this gap by using the pre-training corpus itself to provide verifiable rewards for a reasoning-augmented next-token prediction process.
The key advantages proposed for RPT are:
- Scalability and Generality: Leverages abundant unannotated text data for general-purpose RL pre-training.
- Reduced Reward Hacking: Uses rule-based rewards (correctness of prediction) which are less prone to hacking.
- Improved Generalization: Encourages deeper understanding and reasoning patterns over rote memorization of token sequences.
- Enhanced Prediction Accuracy: The internal reasoning process is akin to allocating more "thought" or computation per prediction, improving next-token prediction.
Methodology: Reinforcement Pre-Training (RPT)
- Next-Token Reasoning Task: For any given context x<t from the training corpus, the model πθ generates a chain-of-thought reasoning sequence ct followed by a prediction yt for the next token. The model's output is ot=(ct,yt). This transforms the pre-training corpus into a large-scale set of reasoning problems.
- Pre-Training with Reinforcement Learning:
RPT uses on-policy RL. For a context x<t, the LLM πθ generates G different "thinking trajectories" {oti}i=1G, where each oti=(cti,yti).
Reward Function (Prefix Matching): A reward rti is given for each trajectory. Let x≥t be the byte sequence of the ground-truth completion and yti be the byte sequence of the prediction yti of length l. The reward is 1 if yti is an exact prefix of x≥t and its length l aligns with a valid token boundary in the ground-truth completion. Otherwise, the reward is 0. This supports multi-token predictions and out-of-vocabulary tokens.
rti={1if yti=x≥t[1:l] and l∈Lgt 0otherwise
* Objective Function: The model is trained to maximize the expected reward: JRPT(θ)=E(x<t,x≥t)∼D, {oti}i=1G∼πθ(⋅∣x<t)[rti].
- Pre-Training Setup:
- Dataset: OmniMATH (4,428 competition-level mathematical problems and solutions).
- Data Filtering: A small proxy model (Deepseek-R1-Distill-Qwen-1.5B) calculates entropy for next-token predictions. Low-entropy (easy-to-predict) tokens are filtered out to focus training on challenging tokens.
- Base Model: Deepseek-R1-Distill-Qwen-14B.
- Training Framework: Implemented with
verl
library andvLLM
for inference. - RL Algorithm: GRPO, with learning rate 1×10−6, zero KL penalty, batch size 256 questions.
- Rollouts: G=8 responses sampled per question, temperature 0.8 for rollouts.
- Prediction Extraction: The sequence inside the last
\boxed{}
after a special token</think>
is taken as the next-token prediction. - Training Details: 8k training length, dynamic sampling after 500 steps, total 1,000 training steps for the main experiment.
Evaluation and Experiments
- LLMing Performance:
- Evaluated on a held-out OmniMATH validation set (200 samples), categorized into easy, medium, and hard splits based on entropy thresholds.
- Results: RPT-14B consistently outperformed the R1-Distill-Qwen-14B baseline (both standard NTP and next-token reasoning modes) and the original Qwen2.5-14B across all difficulty levels. For instance, on the hard split, RPT-14B achieved 23.75% accuracy compared to 20.43% for R1-Distill-Qwen-14B (standard NTP) and 1.41% for R1-Distill-Qwen-14B (next-token reasoning without RPT). RPT-14B's average performance matched that of a larger R1-Distill-Qwen-32B model.
Model Easy Medium Hard Qwen2.5-14B (NTP) 41.90 30.03 20.65 R1-Distill-Qwen-14B (NTP) 41.60 29.46 20.43 R1-Distill-Qwen-14B (Reasoning) 3.31 1.66 1.41 RPT-14B (Reasoning) 45.11 33.56 23.75 Scaling Properties:
- Next-token prediction accuracy of RPT was evaluated at various training steps (compute levels).
- Results: Accuracy consistently improved with increased training compute, following a power-law relationship P(C)=CαA+P∗ with high R2 values, indicating good fit and scalability.
- Reinforcement Fine-Tuning with RPT:
- RPT models were continually fine-tuned with RLVR on a Skywork-OR1 dataset sample (256 training, 200 testing examples).
- Results: RPT-14B achieved higher performance both before (56.3% vs 51.2%) and after (58.3% vs 52.7%) RLVR fine-tuning compared to the baseline R1-Distill-Qwen-14B. Continual NTP training on the same data significantly degraded reasoning ability.
Model Before RL After RL R1-Distill-Qwen-14B 51.2 52.7 ~~~~+ Continual NTP training 10.7 13.0 RPT-14B 56.3 58.3 Zero-Shot Performance on End Tasks:
- Evaluated on MMLU-Pro and SuperGPQA benchmarks in reasoning mode.
- Results: RPT-14B outperformed R1-Distill-Qwen-14B (both NTP and reasoning modes) and even the larger R1-Distill-Qwen-32B (NTP mode) on both benchmarks.
Model SuperGPQA MMLU-Pro R1-Distill-Qwen-14B (NTP) 32.0 48.4 R1-Distill-Qwen-32B (NTP) 37.2 56.5 R1-Distill-Qwen-14B (Reasoning) 36.1 68.9 RPT-14B (Reasoning) 39.0 71.1 Next-Token Reasoning Pattern Analysis:
- Compared reasoning patterns (e.g., hypothesis, deduction, breakdown) in RPT-14B (for next-token reasoning) and R1-Distill-Qwen-14B (for problem-solving) on OmniMATH.
- Results: RPT-14B showed significantly more "hypothesis" (161.8% more) and "deduction" (26.2% more) patterns, while problem-solving relied more on "breakdown." This suggests RPT elicits a different, more inferential reasoning process. Case studies illustrated the model's deliberative process, analyzing context, brainstorming, and weighing alternatives.
Practical Implementation Considerations:
- Computational Cost: RPT involves generating multiple (G) trajectories (reasoning + prediction) for each training step, making it more computationally intensive than standard NTP. However, the paper argues this is akin to inference-time scaling applied at training time.
- Base Model Choice: RPT was initialized from a model (Deepseek-R1-Distill-Qwen-14B) already possessing some reasoning capabilities. Starting from a standard base LLM might require more tuning or yield different insights.
- Data Filtering: The entropy-based filtering of "easy" tokens is crucial for focusing RL on samples where reasoning is more beneficial. The choice of proxy model and entropy threshold for this filtering can impact performance.
- Prompt Engineering: The prompt template used to elicit reasoning is important. The paper explored several variants (Appendix D), noting that clear prompts significantly improve initial performance. The main experiments used a specific template (
v0
). - Reward Design: While prefix matching was used, other reward designs (first-token matching, dense rewards) were explored and found to yield comparable performance, suggesting robustness in this aspect (Appendix A).
- Hyperparameter Tuning: RL training is sensitive to hyperparameters (learning rate, batch size, PPO parameters, sampling temperature). The paper provides a set of hyperparameters used (Appendix B).
Pseudocode for RPT (Conceptual):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Initialize LLM parameters θ For each training epoch: For each batch of contexts {x_<t} from pre-training corpus D: For each context x_<t in batch: // On-policy rollouts Sample G trajectories {o_j = (c_j, y_j)}_j=1^G from π_θ(· | x_<t) // c_j is reasoning, y_j is next-token prediction // Calculate rewards For each trajectory o_j: r_j = calculate_prefix_matching_reward(y_j, ground_truth_next_token x_t) // Store trajectories and rewards for policy update Add (x_<t, {o_j}, {r_j}) to PPO buffer // Update model parameters Update θ using an RL algorithm (e.g., GRPO) with data from PPO buffer Objective: Maximize Expected[r_j] |
Conclusion and Future Work
RPT is presented as a promising new scaling paradigm that improves next-token prediction, enhances zero-shot reasoning on downstream tasks, and provides a better foundation for RL fine-tuning. Limitations include experiments primarily on a 14B model and a math-focused corpus. Future work includes:
- Scaling to larger and more diverse corpora (general web text).
- Scaling up training compute.
- Establishing scaling laws specifically for RPT.
- Integrating hybrid thinking to adaptively trigger next-token reasoning.
- Investigating RPT training from standard base LLMs.
This approach effectively teaches the model to "think before it speaks" at the most fundamental level of LLMing – predicting the next token – by making the "thinking" process itself a target for optimization via reinforcement learning.
Related Papers
- Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking (2024)
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025)
- LLM Post-Training: A Deep Dive into Reasoning Large Language Models (2025)
- Concise Reasoning via Reinforcement Learning (2025)
- RAST: Reasoning Activation in LLMs via Small-model Transfer (2025)
Tweets
YouTube
HackerNews
- Reinforcement Pre-Training (70 points, 18 comments)
- Reinforcement Pre-Training (38 points, 9 comments)
- Reinforcement Pre-Training (17 points, 0 comments)
- Reinforcement Pre-Training (14 points, 5 comments)