Papers
Topics
Authors
Recent
2000 character limit reached

RLP: Reinforcement as a Pretraining Objective (2510.01265v1)

Published 26 Sep 2025 in cs.LG, cs.AI, and cs.CL

Abstract: The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning -- exploration -- to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight-benchmark math-and-science suite by 19%. With identical post-training, the gains compound, with the largest improvements on reasoning-heavy tasks such as AIME25 and MMLU-Pro. Applying RLP to the hybrid Nemotron-Nano-12B-v2 increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.

Summary

  • The paper introduces RLP, a novel reinforcement pretraining method that uses chains of thought to generate dense reward signals, improving model reasoning.
  • The methodology integrates gradient updates on chain-of-thought tokens with traditional likelihood training, yielding a 19% overall performance boost and 23% gain in scientific tasks.
  • The approach scales effectively across different LLM architectures, reducing data requirements while enhancing mathematical and scientific reasoning benchmarks.

Reinforcement as a Pretraining Objective: An Exploration of RLP

Introduction

The paper "RLP: Reinforcement as a Pretraining Objective" explores the novel reinforcement learning pretraining method (RLP) for LLMs. Traditional training paradigms rely heavily on next-token prediction and incorporate reinforcement learning (RL) predominantly in the post-training phase. This approach aims to introduce exploring thought processes at the pretraining stage itself, using chains of thought (CoT) to drive predictive success and employing an information-gain based reward mechanism.

RLP Methodology

RLP supplements the typical pretraining phase by treating CoT as exploratory actions within the model's decision-making process. The model first samples a CoT before predicting subsequent tokens. This CoT determines a reward based on the improvement in log-likelihood of the predicted token, juxtaposed with a no-think Exponential Moving Average (EMA) baseline (Figure 1). Figure 1

Figure 1: Visualization of the RLP framework. A chain-of-thought is sampled before next-token prediction. Rewards are computed by contrasting the predictor conditioned on the CoT with a No-think EMA baseline, yielding a verifier-free, dense signal.

RLP's reward mechanism differs from traditional approaches by providing dense, verifier-free signals applicable across various domains. Parameters are optimized through gradient updates only on CoT tokens. The algorithm interpolates these updates with traditional likelihood training through careful use of group-relative advantages and a clipped surrogate loss.

Experimental Insights

RLP was tested on diverse data corpuses ranging from academic texts to freely available web data, and the results were benchmarked against strong baselines using models like qwen3-1.7b-base and Nemotron-Nano-12B-v2.

The empirical results indicated a substantial improvement across reasoning benchmarks, with an average gain of 19% over models trained via continuous pretraining. Specific strong improvements included mathematical and scientific reasoning tasks, showing gains of 23% on scientific reasoning scores. Importantly, the impact of RLP was pronounced even after strong post-training stages.

Scaling and Generalization

The methodology's architecture-agnostic design asserts its generalizability. RLP was scaled to the Nemotron-Nano-12B-v2 model, revealing its capability to enhance performance with substantially fewer data requirements, further verifying its efficient scalability across model architectures and sizes. This emphasizes that RLP's reinforcement pretraining can effectively elevate base model capabilities before entering the standard post-training pipeline.

Comparisons and Ablations

The paper benchmarks RLP against reinforcement pretraining through prefix-matching rewards (RPT). Under matched data and compute conditions, RLP shows superior average accuracies across tasks, validating its robust reward mechanisms.

The paper also examines RLP's behavior under varying rollout numbers, completion lengths, and KL coefficients during training. Results emphasize that optimal configurations, such as 16 rollouts and a completion length of 2048 tokens, yield significant performance enhancements without the necessity of a KL penalty in the architecture.

Conclusion

RLP effectively reframes reinforcement learning for reasoning as a pretraining objective, enhancing reasoning abilities in LLMs with sparse training data. It provides a sound, scalable methodology that integrates reinforcement principles from the inception of training, contrasting with traditional post-training adjustments. The proposed framework renders it a valuable contribution to improving reasoning-driven applications across broad linguistic domains in AI research.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 14 tweets with 8787 likes about this paper.

Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit

alphaXiv