- The paper introduces RL^V, which unifies policy reasoning and generative verification to enhance test-time compute scaling and accuracy.
- It demonstrates that integrating a value-based verifier during RL training significantly improves generalization and boosts performance by over 20% on the MATH dataset.
- The unified training of a single LLM streamlines inference by enabling efficient selection of answers via Best-of-N and weighted voting strategies.
This paper introduces RLV, a method that enhances reinforcement learning (RL) fine-tuning of LLMs for reasoning tasks by integrating verification capabilities without abandoning the learned value function, unlike prevalent "value-free" RL approaches such as GRPO (DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, 5 Feb 2024), Leave-One-Out PPO (Reinforcement Learning for Long-Horizon Interactive LLM Agents, 3 Feb 2025), and VinePPO (VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment, 2 Oct 2024). While these value-free methods improve training efficiency by foregoing a separate value network, they sacrifice the ability to use a value-based verifier at test time, which is crucial for efficient compute scaling through techniques like Best-of-N or weighted majority voting. RLV addresses this by jointly training the LLM as both a reasoner (policy) and a generative verifier, leveraging the data generated during the RL training process itself.
The core idea of RLV is to train a single LLM using a unified objective function that combines the standard RL fine-tuning objective (JRL) with a generative verification objective (JVerify). The JVerify objective is a supervised fine-tuning (SFT) loss where the model is trained to predict a specific token (e.g., 'Yes' or 'No') indicating the correctness of a problem-solution pair. This training data consists of the solutions generated by the LLM policy during RL training, labeled with their correctness (reward). The unified objective is defined as:
JUnified(θ):=JRL(θ;)+λJVerify(θ;), where λ is a hyperparameter balancing the two objectives.
At test time, the same fine-tuned LLM acts as both the reasoner, generating candidate solutions, and the verifier. For a given problem x and a generated solution y, the LLM scores the solution by computing the probability it assigns to the 'Yes' token when prompted with the verification instruction: s(x,y):=πθ(Yes∣x,y,’Is this solution correct? Answer Yes or No.’). These scores are then used to select the final answer among multiple generated samples (N-best solutions). The paper evaluates three strategies: Majority Voting (verifier-free baseline), Best-of-N (select highest score), and Weighted Voting (sum scores for identical answers and select the answer with the highest sum).
The practical benefits of RLV demonstrated in experiments are significant:
- Improved Test-Time Compute Scaling: On the MATH dataset, RLV boosts accuracy by over 20% compared to the base RL method when using parallel sampling and enables 8-32x more efficient test-time compute scaling. The unified verifier is shown to be much more effective for scoring solutions than simply using the base RL policy via LLM-as-a-Judge prompting (Learning to summarize from human feedback, 2020).
- Strong Generalization: RLV exhibits strong generalization capabilities. It performs well not only on problems from the same distribution (MATH500) but also generalizes to harder problems (MATH²) and out-of-domain tasks like GPQA Physics, consistently outperforming the base RL methods across varying numbers of sampled solutions.
- Positive Transfer to Pass@1: The unified training process can positively impact the reasoner's performance even without additional test-time compute, leading to improved pass@1 accuracy (solving the problem with a single generation).
- Complementary to Sequential Scaling: RLV is shown to complement the sequential scaling often used in long Chain-of-Thought (CoT) models like DeepSeek-R1 (DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 22 Jan 2025). When scaling both parallel samples and sequential generation length (using budget forcing (s1: Simple test-time scaling, 31 Jan 2025)), GRPOV consistently outperforms the base GRPO method. The verifier confidence can also be used to dynamically allocate more sequential compute (longer generation length) to more difficult problems until a confidence threshold is met.
- Effective Unified Training: The generative verification approach used in RLV is empirically shown to be superior to alternative methods like using separate verification heads trained with binary cross-entropy or regression, both in terms of reasoner and verifier accuracy.
Implementation considerations include the choice of the base "value-free" RL algorithm and tuning the verification coefficient λ. The paper finds that Leave-One-Out PPOV shows a better trade-off between reasoner and verifier accuracy compared to GRPOV, with peak reasoner performance occurring at an intermediate λ. For test-time inference strategies, Weighted Voting generally performs best for models generating shorter CoTs, while the optimal strategy can differ for long CoT models.
In summary, RLV provides a practical and efficient way to reintroduce verification capabilities into modern value-free RL methods for LLM reasoning. By training a single model jointly on reasoning and generative verification using RL-generated data, it significantly enhances test-time scaling efficiency and accuracy across various tasks and model sizes, while also showing promising generalization and complementarity with sequential reasoning techniques. The approach avoids the overhead of training and deploying separate verifier models.