Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers (2505.04842v1)

Published 7 May 2025 in cs.LG and cs.AI

Abstract: Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL$^V$ that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL$^V$ boosts MATH accuracy by over 20\% with parallel sampling and enables $8-32\times$ efficient test-time compute scaling compared to the base RL method. RL$^V$ also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL$^V$ achieves $1.2-1.6\times$ higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.

Summary

The paper introduces RL^V, which unifies policy reasoning and generative verification to enhance test-time compute scaling and accuracy.
It demonstrates that integrating a value-based verifier during RL training significantly improves generalization and boosts performance by over 20% on the MATH dataset.
The unified training of a single LLM streamlines inference by enabling efficient selection of answers via Best-of-N and weighted voting strategies.

This paper introduces RL $^V$ , a method that enhances reinforcement learning (RL) fine-tuning of LLMs for reasoning tasks by integrating verification capabilities without abandoning the learned value function, unlike prevalent "value-free" RL approaches such as GRPO (DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, 5 Feb 2024), Leave-One-Out PPO (Reinforcement Learning for Long-Horizon Interactive LLM Agents, 3 Feb 2025), and VinePPO (VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment, 2 Oct 2024). While these value-free methods improve training efficiency by foregoing a separate value network, they sacrifice the ability to use a value-based verifier at test time, which is crucial for efficient compute scaling through techniques like Best-of-N or weighted majority voting. RL $^V$ addresses this by jointly training the LLM as both a reasoner (policy) and a generative verifier, leveraging the data generated during the RL training process itself.

The core idea of RL $^V$ is to train a single LLM using a unified objective function that combines the standard RL fine-tuning objective ( $\mathcal{J}_{RL}$ ) with a generative verification objective ( $\mathcal{J}_{Verify}$ ). The $\mathcal{J}_{Verify}$ objective is a supervised fine-tuning (SFT) loss where the model is trained to predict a specific token (e.g., 'Yes' or 'No') indicating the correctness of a problem-solution pair. This training data consists of the solutions generated by the LLM policy during RL training, labeled with their correctness (reward). The unified objective is defined as: $\mathcal{J}_{Unified}(\theta) := \mathcal{J}_{RL}(\theta; ) + \lambda \mathcal{J}_{Verify}(\theta; )$ , where $\lambda$ is a hyperparameter balancing the two objectives.

At test time, the same fine-tuned LLM acts as both the reasoner, generating candidate solutions, and the verifier. For a given problem $x$ and a generated solution $y$ , the LLM scores the solution by computing the probability it assigns to the 'Yes' token when prompted with the verification instruction: $s(x, y) := \pi_\theta(\text{Yes} \mid x, y, \text{'Is this solution correct? Answer Yes or No.'})$ . These scores are then used to select the final answer among multiple generated samples (N-best solutions). The paper evaluates three strategies: Majority Voting (verifier-free baseline), Best-of-N (select highest score), and Weighted Voting (sum scores for identical answers and select the answer with the highest sum).

The practical benefits of RL $^V$ demonstrated in experiments are significant:

Improved Test-Time Compute Scaling: On the MATH dataset, RL $^V$ boosts accuracy by over 20% compared to the base RL method when using parallel sampling and enables 8-32x more efficient test-time compute scaling. The unified verifier is shown to be much more effective for scoring solutions than simply using the base RL policy via LLM-as-a-Judge prompting (Learning to summarize from human feedback, 2020).
Strong Generalization: RL $^V$ exhibits strong generalization capabilities. It performs well not only on problems from the same distribution (MATH500) but also generalizes to harder problems (MATH²) and out-of-domain tasks like GPQA Physics, consistently outperforming the base RL methods across varying numbers of sampled solutions.
Positive Transfer to Pass@1: The unified training process can positively impact the reasoner's performance even without additional test-time compute, leading to improved pass@1 accuracy (solving the problem with a single generation).
Complementary to Sequential Scaling: RL $^V$ is shown to complement the sequential scaling often used in long Chain-of-Thought (CoT) models like DeepSeek-R1 (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 22 Jan 2025). When scaling both parallel samples and sequential generation length (using budget forcing (s1: Simple test-time scaling, 31 Jan 2025)), GRPO $^V$ consistently outperforms the base GRPO method. The verifier confidence can also be used to dynamically allocate more sequential compute (longer generation length) to more difficult problems until a confidence threshold is met.
Effective Unified Training: The generative verification approach used in RL $^V$ is empirically shown to be superior to alternative methods like using separate verification heads trained with binary cross-entropy or regression, both in terms of reasoner and verifier accuracy.

Implementation considerations include the choice of the base "value-free" RL algorithm and tuning the verification coefficient $\lambda$ . The paper finds that Leave-One-Out PPO $^V$ shows a better trade-off between reasoner and verifier accuracy compared to GRPO $^V$ , with peak reasoner performance occurring at an intermediate $\lambda$ . For test-time inference strategies, Weighted Voting generally performs best for models generating shorter CoTs, while the optimal strategy can differ for long CoT models.

In summary, RL $^V$ provides a practical and efficient way to reintroduce verification capabilities into modern value-free RL methods for LLM reasoning. By training a single model jointly on reasoning and generative verification using RL-generated data, it significantly enhances test-time scaling efficiency and accuracy across various tasks and model sizes, while also showing promising generalization and complementarity with sequential reasoning techniques. The approach avoids the overhead of training and deploying separate verifier models.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/fly51fly/status/1920954048338858070

https://twitter.com/theomitsa/status/1923677457267179707

https://twitter.com/theomitsa/status/1923677402342760519

https://twitter.com/Athekunal/status/1923063112959242349

https://twitter.com/chaumian/status/1921308554641670448

https://twitter.com/GptMaestro/status/1923668442483429690

YouTube

Show All Videos