- The paper presents VeriFree, a novel RL framework that improves LLMs' general reasoning by directly maximizing the likelihood of reference answers without explicit verifiers.
- It leverages reduced gradient variance and Reinforce Leave-One-Out (RLOO) to achieve faster convergence and superior performance on benchmarks like MMLU-Pro and SuperGPQA.
- The method demonstrates transferable reasoning skills across domains, reducing reliance on verifier models and cutting computational overhead.
This paper introduces VeriFree, a novel verifier-free reinforcement learning (RL) method designed to enhance the general reasoning capabilities of LLMs without relying on explicit answer verifiers (2505.21493). The authors address the limitations of existing DeepSeek-R1-Zero-style RL, which excels in domains like math and coding where rule-based answer verification is feasible but struggles with general reasoning tasks (e.g., chemistry, law, biology) where such verification is difficult or impossible. While model-based verifiers (using another LLM) are a workaround, they introduce dependencies, potential for reward hacking, and computational overhead.
VeriFree bypasses the need for any verifier by directly maximizing the probability of generating the reference answer given a question and a model-generated reasoning trace. The core idea is to:
- Have the LLM (policy πθ) generate a reasoning trace c in response to a question q.
- Concatenate this generated reasoning trace c with the known reference answer a⋆ from the dataset.
- Evaluate the likelihood πθ(a⋆∣q,c) of the reference answer a⋆ conditioned on the question q and the generated reasoning trace c.
- This likelihood πθ(a⋆∣q,c) serves as the reward signal.
The VeriFree objective function is JVeriFree(θ;q,a⋆)=Ec∼πθ(⋅∣q)[πθ(a⋆∣q,c)]. This is shown to be equivalent in expectation to the verifier-based objective $J_{\text{Verifier}}(\theta; q, a^\star) = E_{c \sim \pi_\theta(\cdot|q)} E_{a \sim \pi_\theta(\cdot|q,c)}[\mathds{1}_{\{a \equiv a^\star\}}]$ when there's a unique correct answer string.
The gradient estimator for VeriFree is derived as:
∇θJVeriFree(θ;q,a⋆)=Ec∼πθ(⋅∣q)[RVeriFree(q,a⋆,c)[∇θlogπθ(c∣q)+∇θlogπθ(a⋆∣q,c)]]
where RVeriFree(q,a⋆,c)=πθ(a⋆∣q,c). The first term ∇θlogπθ(c∣q) is a policy gradient for the reasoning trace, and the second term ∇θlogπθ(a⋆∣q,c) is a reward-weighted supervised learning term for the reference answer.
A key theoretical advantage highlighted is variance reduction. Theorem 1 states that the variance of the VeriFree gradient estimator is less than or equal to that of the verifier-based estimator, a result of Rao-Blackwellization by analytically marginalizing out the answer sampling step.
The final on-policy gradient estimator incorporates RLOO (Reinforce Leave-One-Out) for further variance reduction:
∇θJVeriFree(θ)=G1i=1∑G[Ai⋅∇θlogπθ(ci∣q)+Ri⋅∇θlogπθ(a⋆∣q,ci)]
where ci∼πθ(⋅∣q), Ri=πθ(a⋆∣q,ci), and Ai=Ri−G−11j=i∑πθ(a⋆∣q,cj).
A practical implementation challenge addressed is tokenization at the "patching point" where the generated reasoning trace c meets the reference answer a⋆. To ensure consistent tokenization, the authors define the end of c at the token corresponding to "<answer" (without the closing ">"), which is equivalent to using "<answer" as a stop word during sampling.
Experiments and Results:
- Models: Qwen3 base models (1.7B, 4B, 8B parameters).
- Training Data: "WebData," a curated dataset of ~61,000 samples from WebInstruct, filtered for quality and answer length.
- Evaluation Benchmarks:
- General Reasoning: MMLU-Pro, SuperGPQA, GPQA.
- Math Reasoning: MATH-500, OlympiadBench, Minerva Math, GSM8K, AMC, AIME24.
- Baselines:
- "Verifier": A verifier-based approach using a fine-tuned Qwen2.5-Math-1.5B model as the verifier, optimized with Dr.GRPO.
- Qwen3 base and instruct models, and other publicly available RL-tuned models.
Key Findings:
- Improved General Reasoning: VeriFree significantly improved the general reasoning capabilities of base LLMs on MMLU-Pro (12%-40% average accuracy gain) and SuperGPQA, often matching or surpassing instruct models and the "Verifier" baseline.
- Better Learning Efficiency: VeriFree demonstrated faster convergence and higher final accuracy compared to the verifier-based baseline, attributed to reduced gradient variance. (See Figure 4 for training dynamics).
- Model Confidence as Proxy: A strong positive correlation (ρ=0.82) was found between MMLU-Pro accuracy and the model's average confidence πθ(a⋆∣q,c) during training, suggesting this confidence is a good proxy for reasoning capability.
- Transferable Reasoning Skills: A model trained with VeriFree on non-math data showed improved performance on math benchmarks, indicating that VeriFree learns generalizable reasoning skills.
- Ablation Studies:
- The proposed tokenization-aware splitting strategy for reasoning traces was crucial for stable optimization.
- RLOO significantly contributed to performance.
- Incorporating an equivalence class of correct answers (instead of a single reference answer) showed slight performance improvements in math tasks, suggesting a minor limitation and area for future work.
Comparison to Existing Verifier-Free Approaches:
The paper distinguishes VeriFree from variational inference-based methods like JLB (2503.19618) and LaTRO (2411.04282). While these methods also treat the reasoning trace as a latent variable, VeriFree's objective is argued to be closer to the original verifier-based objective (under a single-correct-answer assumption). VeriFree weights the reference answer term by πθ(a⋆∣q,c), unlike JLB and LaTRO which use a weight of 1, potentially avoiding reinforcement of mismatches between flawed reasoning and correct answers.
Implementation Pseudocode Comparison:
Verifier-based (R1-Zero) |
VeriFree (Ours) |
Model generates reasoning trace c and answer a. |
Model generates reasoning trace c. |
Extract the answer a. |
Patch in the correct answer a⋆. |
Check answer using a verifier. |
Evaluate probability πθ(a⋆∣q,c). |
Reward RVerifier=1 if correct, 0 otherwise. |
Reward RVeriFree=πθ(a⋆∣q,c). |
Train with ∇θJVerifier. |
Train with ∇θJVeriFree. |
Conclusion:
VeriFree offers a practical and effective method for extending R1-Zero-style RL training to general reasoning domains where verifiers are unavailable or costly. It achieves this by directly optimizing the likelihood of the reference answer given a generated reasoning trace, leading to comparable or superior performance to verifier-based methods with reduced computational requirements and improved learning efficiency due to lower variance gradients. The work provides a new perspective for LLM RL and a path towards building more general-purpose reasoners.