Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reinforcing General Reasoning without Verifiers (2505.21493v1)

Published 27 May 2025 in cs.LG and cs.CL

Abstract: The recent paradigm shift towards training LLMs using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.

Summary

  • The paper presents VeriFree, a novel RL framework that improves LLMs' general reasoning by directly maximizing the likelihood of reference answers without explicit verifiers.
  • It leverages reduced gradient variance and Reinforce Leave-One-Out (RLOO) to achieve faster convergence and superior performance on benchmarks like MMLU-Pro and SuperGPQA.
  • The method demonstrates transferable reasoning skills across domains, reducing reliance on verifier models and cutting computational overhead.

This paper introduces VeriFree, a novel verifier-free reinforcement learning (RL) method designed to enhance the general reasoning capabilities of LLMs without relying on explicit answer verifiers (2505.21493). The authors address the limitations of existing DeepSeek-R1-Zero-style RL, which excels in domains like math and coding where rule-based answer verification is feasible but struggles with general reasoning tasks (e.g., chemistry, law, biology) where such verification is difficult or impossible. While model-based verifiers (using another LLM) are a workaround, they introduce dependencies, potential for reward hacking, and computational overhead.

VeriFree bypasses the need for any verifier by directly maximizing the probability of generating the reference answer given a question and a model-generated reasoning trace. The core idea is to:

  1. Have the LLM (policy πθ\pi_\theta) generate a reasoning trace cc in response to a question qq.
  2. Concatenate this generated reasoning trace cc with the known reference answer aa^\star from the dataset.
  3. Evaluate the likelihood πθ(aq,c)\pi_\theta(a^\star | q, c) of the reference answer aa^\star conditioned on the question qq and the generated reasoning trace cc.
  4. This likelihood πθ(aq,c)\pi_\theta(a^\star | q, c) serves as the reward signal.

The VeriFree objective function is JVeriFree(θ;q,a)=Ecπθ(q)[πθ(aq,c)]J_{\text{VeriFree}}(\theta; q, a^\star) = E_{c \sim \pi_\theta(\cdot|q)}[\pi_\theta(a^\star|q,c)]. This is shown to be equivalent in expectation to the verifier-based objective $J_{\text{Verifier}}(\theta; q, a^\star) = E_{c \sim \pi_\theta(\cdot|q)} E_{a \sim \pi_\theta(\cdot|q,c)}[\mathds{1}_{\{a \equiv a^\star\}}]$ when there's a unique correct answer string.

The gradient estimator for VeriFree is derived as:

θJVeriFree(θ;q,a)=Ecπθ(q)[RVeriFree(q,a,c)[θlogπθ(cq)+θlogπθ(aq,c)]]\nabla_\theta J_{\text{VeriFree}}(\theta; q, a^\star) = E_{c \sim \pi_\theta(\cdot|q)}\bigg[R_{\text{VeriFree}}(q, a^\star, c)\big[\nabla_\theta\log\pi_\theta(c|q) + \nabla_\theta\log\pi_\theta(a^\star|q,c)\big]\bigg]

where RVeriFree(q,a,c)=πθ(aq,c)R_{\text{VeriFree}}(q, a^\star, c) = \pi_\theta(a^\star|q,c). The first term θlogπθ(cq)\nabla_\theta\log\pi_\theta(c|q) is a policy gradient for the reasoning trace, and the second term θlogπθ(aq,c)\nabla_\theta\log\pi_\theta(a^\star|q,c) is a reward-weighted supervised learning term for the reference answer.

A key theoretical advantage highlighted is variance reduction. Theorem 1 states that the variance of the VeriFree gradient estimator is less than or equal to that of the verifier-based estimator, a result of Rao-Blackwellization by analytically marginalizing out the answer sampling step. The final on-policy gradient estimator incorporates RLOO (Reinforce Leave-One-Out) for further variance reduction:

θJVeriFree(θ)=1Gi=1G[Aiθlogπθ(ciq)+Riθlogπθ(aq,ci)]\nabla_\theta J_{\text{VeriFree}}(\theta) = \frac{1}{G} \sum_{i=1}^G \left[A_i\cdot\nabla_\theta\log\pi_\theta(c_i | q) + R_i\cdot\nabla_\theta\log\pi_\theta(a^\star | q, c_i)\right]

where ciπθ(q)c_i \sim \pi_\theta(\cdot|q), Ri=πθ(aq,ci)R_i = \pi_\theta(a^\star |q, c_i), and Ai=Ri1G1jiπθ(aq,cj)A_i = R_i - \frac{1}{G-1}\sum_{j\neq i} \pi_\theta(a^\star |q, c_j).

A practical implementation challenge addressed is tokenization at the "patching point" where the generated reasoning trace cc meets the reference answer aa^\star. To ensure consistent tokenization, the authors define the end of cc at the token corresponding to "<answer" (without the closing ">"), which is equivalent to using "<answer" as a stop word during sampling.

Experiments and Results:

  • Models: Qwen3 base models (1.7B, 4B, 8B parameters).
  • Training Data: "WebData," a curated dataset of ~61,000 samples from WebInstruct, filtered for quality and answer length.
  • Evaluation Benchmarks:
    • General Reasoning: MMLU-Pro, SuperGPQA, GPQA.
    • Math Reasoning: MATH-500, OlympiadBench, Minerva Math, GSM8K, AMC, AIME24.
  • Baselines:
    • "Verifier": A verifier-based approach using a fine-tuned Qwen2.5-Math-1.5B model as the verifier, optimized with Dr.GRPO.
    • Qwen3 base and instruct models, and other publicly available RL-tuned models.

Key Findings:

  1. Improved General Reasoning: VeriFree significantly improved the general reasoning capabilities of base LLMs on MMLU-Pro (12%-40% average accuracy gain) and SuperGPQA, often matching or surpassing instruct models and the "Verifier" baseline.
  2. Better Learning Efficiency: VeriFree demonstrated faster convergence and higher final accuracy compared to the verifier-based baseline, attributed to reduced gradient variance. (See Figure 4 for training dynamics).
  3. Model Confidence as Proxy: A strong positive correlation (ρ=0.82\rho = 0.82) was found between MMLU-Pro accuracy and the model's average confidence πθ(aq,c)\pi_\theta(a^\star|q,c) during training, suggesting this confidence is a good proxy for reasoning capability.
  4. Transferable Reasoning Skills: A model trained with VeriFree on non-math data showed improved performance on math benchmarks, indicating that VeriFree learns generalizable reasoning skills.
  5. Ablation Studies:
    • The proposed tokenization-aware splitting strategy for reasoning traces was crucial for stable optimization.
    • RLOO significantly contributed to performance.
    • Incorporating an equivalence class of correct answers (instead of a single reference answer) showed slight performance improvements in math tasks, suggesting a minor limitation and area for future work.

Comparison to Existing Verifier-Free Approaches:

The paper distinguishes VeriFree from variational inference-based methods like JLB (2503.19618) and LaTRO (2411.04282). While these methods also treat the reasoning trace as a latent variable, VeriFree's objective is argued to be closer to the original verifier-based objective (under a single-correct-answer assumption). VeriFree weights the reference answer term by πθ(aq,c)\pi_\theta(a^\star|q,c), unlike JLB and LaTRO which use a weight of 1, potentially avoiding reinforcement of mismatches between flawed reasoning and correct answers.

Implementation Pseudocode Comparison:

Verifier-based (R1-Zero) VeriFree (Ours)
Model generates reasoning trace cc and answer aa. Model generates reasoning trace cc.
Extract the answer aa. Patch in the correct answer aa^\star.
Check answer using a verifier. Evaluate probability πθ(aq,c)\pi_\theta(a^\star|q,c).
Reward RVerifier=1R_{\text{Verifier}} = 1 if correct, 0 otherwise. Reward RVeriFree=πθ(aq,c)R_{\text{VeriFree}} = \pi_\theta(a^\star|q,c).
Train with θJVerifier\nabla_\theta J_{\text{Verifier}}. Train with θJVeriFree\nabla_\theta J_{\text{VeriFree}}.

Conclusion:

VeriFree offers a practical and effective method for extending R1-Zero-style RL training to general reasoning domains where verifiers are unavailable or costly. It achieves this by directly optimizing the likelihood of the reference answer given a generated reasoning trace, leading to comparable or superior performance to verifier-based methods with reduced computational requirements and improved learning efficiency due to lower variance gradients. The work provides a new perspective for LLM RL and a path towards building more general-purpose reasoners.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com