- The paper introduces a decoupled RLVR approach where majority voting proposes answers and Lean-based verification assigns rewards.
- It demonstrates robust performance across mathematical reasoning, code generation, and multi-task benchmarks with enhanced diversity.
- The ResZero fallback mechanism preserves training stability by preventing consensus collapse and sustaining exploration.
JURY-RL: A Decoupled Label-Free Reinforcement Learning with Verifiable Rewards Framework
Motivation and Context
The refinement of LLM reasoning on machine-checkable domains via RL with verifiable rewards (RLVR) remains a central challenge, particularly due to the high cost and limited scalability of human-annotated ground-truth rewards or carefully constructed programmatic reward specifications. Label-free surrogates such as majority-voting, entropy minimization, and LLM-as-a-Judge have been proposed to mitigate annotation costs. However, these approaches introduce reward hacking, training collapse, and a severe risk of reinforcing consensus on incorrect outputs—undermining both training stability and ground-truth alignment.
This work introduces JURY-RL, a label-free RLVR paradigm that separates the answer proposal process from reward disposal: model rollouts propose a candidate via scalable majority voting, and an external formal verifier—specifically, Lean—is used as the reward gatekeeper. The method is designed to ensure scalability, strict truth alignment, and optimization stability, directly targeting the failure modes of existing label-free RLVR methods.
Framework Description
Decoupled Proposal and Disposal ("Votes Propose, Proofs Dispose")
JURY-RL's core innovation is the strict decoupling of proposal (answer generation by majority voting among rollouts) from reward assignment (disposal of positive reward via formal verification). Given an input x, G rollouts from the policy yield candidate answers {ai}; majority vote selects the proposal a^. The formal verifier determines whether a^ satisfies the ground-truth via specification spec(x).
This strategy ensures that:
- Verification is computed only for the highest-likelihood candidate per group, minimizing costly formal verification calls.
- Only rollouts agreeing with a formally verified proposal are rewarded; no reinforcement of unverifiable consensus is permitted.
Residual-Zero (ResZero) Fallback Reward
When the majority proposal cannot be formally verified (either due to incorrectness or inconclusive verification/failure in upstream autoformalization), JURY-RL activates the ResZero reward. This scheme:
- Assigns zero-mean, variance-preserving rewards to the residual (minority) answers, penalizing the unverifiable majority proportionally to its support.
- Ensures a stable, non-degenerate optimization gradient, thereby preventing both consensus collapse and optimization stagnation when verification is silent.
ResZero adaptively amplifies minority rewards as the (wrong) consensus strengthens, which empirically suppresses mode collapse—a critical deficiency in naive majority-voting and self-consistency approaches.
Experimental Evaluation
JURY-RL is evaluated on diverse open-source instruction-tuned and base models (Qwen3-1.7B-Base, Llama-3.2-3B-Instruct, Qwen2.5-7B), trained on mathematical reasoning problems (MATH5000) and tested across mathematical, code, and general multi-task benchmarks. Strong baselines include Majority-Voting, Self-Certainty, Entropy Minimization, CoReward, LLM-as-a-Judge, LLM-KD, and fully supervised GRPO (GT).
Key findings:
- Math Reasoning: JURY-RL consistently attains or exceeds pass@1 accuracy of supervised ground-truth RL (GT), and yields significant improvement in pass@k, indicating increased solution diversity and broader coverage.
- Code and Instruction Generalization: The method transfers robustly to code generation (LiveCodeBench, CRUX) and multi-task language benchmarks (IFEval, MMLU-Pro), outperforming all label-free methods and matching or exceeding GT and LLM-KD.
- Training Stability: Unlike most label-free RLVR baselines, JURY-RL avoids entropy collapse, maintaining monotonic performance and exploration throughout training. Empirical validation confirms that majority voting or zero-reward fallbacks degrade to consensus collapse or gradient starvation, respectively.
- Verifier Fidelity: The Lean-based verifier achieves 84.5% precision and 86.2% F1—higher than LLM-as-a-Judge's 75.9%/84.8%—thus drastically reducing false positives and reward hacking.
Theoretical Analysis
The formulation of ResZero and the proof-gated reward is supported by rigorous theoretical analysis:
- Brittleness of Majority Voting: Under the group-normalized RL objective (GRPO), vanilla majority voting reward causes group advantage for supporters to vanish and for dissenters to diverge negative as consensus strengthens, driving the policy toward entropy collapse and suppressing exploration.
- Zero-Mean, Variance-Preserving Fallback: ResZero is constructed to guarantee groupwise zero-mean reward and non-zero variance among residuals unless trivial consensus is achieved, which is critical for optimizer-agnostic stability and learning progress.
Implications and Future Directions
JURY-RL systematically addresses the pathologies of label-free RLVR by limiting reward assignment to externally verified correctness and ensuring that fallback signals perpetuate exploration without reward hacking or learning collapse. Empirically, it unifies the strongest properties of scalable, label-free RL and formal correctness, effectively matching (and in some metrics surpassing) supervised or judge-based training. Notably, the method increases response diversity and generalization robustness, which are essential for practical AI deployment in domains reliant on correctness over rhetorical plausibility.
From a theoretical perspective, the framework operationalizes a new paradigm: decoupling generative proposal from stringent, externally certified disposal of reward, a principle extendable to diverse machine-verifiable domains beyond mathematics (e.g., code, logic, or procedural planning). Integration with more advanced formal verifiers or autoformalization pipelines could further improve coverage and reward density without sacrificing reward precision. Extending the approach to domains lacking external checkers, or leveraging the method for online test-time RL adaptation, are promising future directions.
Conclusion
JURY-RL establishes a practical, theoretically principled approach for RLVR that is label-free, scalable, and robust to the failure modes of consensus- or judge-based surrogates. It attains performance and generalization on par with strong supervised RL, with superior diversity and stability. The framework demonstrates that reward signals rooted in sparse but highly reliable formal verification, coupled with conservative fallback on ambiguous cases, constitute an optimal frontier for scalable, truth-aligned reasoning in LLMs.
Reference:
JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR (2604.25419)