Unsupervised RL with Verifiable Rewards
- The paper introduces URLVR, a novel framework that leverages deterministic, verifiable rewards to optimize policies without relying on human-labeled feedback.
- It presents a comprehensive taxonomy and composite reward schemes, integrating intrinsic and external signals to address reward hacking and verifier noise.
- Empirical results demonstrate robust scaling for applications including code generation and LLM fine-tuning, with adaptive corrections enhancing sample efficiency.
Unsupervised Reinforcement Learning with Verifiable Rewards (URLVR) is a framework in which reinforcement learning is applied to domains where reward signals are automatically computed through deterministic, objective verification processes, rather than through human annotation or learned preference models. URLVR enables policy optimization in LLMs and other systems by relying on signals such as reference answer matching, code execution results, or structured reference-based checks, thereby supporting unsupervised or semi-supervised learning at scale.
1. Formal Foundations and Taxonomy of Verifiable Rewards
Verifiable rewards are signals that are deterministically computed by checking a generated output against a ground truth or specification using an automated process. Formally, for an input and model output , a verifiable reward is a discrete signal that encodes objective correctness and compliance with prescribed output format. For instance, in medical multiple-choice question answering, the reward function is:
URLVR methods are distinguished from reward modeling approaches such as RLHF or RLAIF, which rely on dense, differentiable, and often subjective signals. In contrast, verifiable rewards are sparse and objective, but susceptible to specification gaming, where models exploit loopholes in the verification process ("reward hacking").
A comprehensive taxonomy separates URLVR methods by the source of their reward signals (He et al., 9 Mar 2026):
- Intrinsic rewards: Use only model-internal signals such as certainty (token probabilities, self-consistency, entropy) or ensemble-based majority voting over model generations.
- External rewards: Rely on truths external to the model, such as reference answers, code/test execution, or structured information from unlabeled data.
Intrinsic rewards are limited by the model's prior knowledge and confidence alignment with correctness, while external rewards can scale as new robust verification mechanisms are introduced.
2. Reward Design and Algorithms in URLVR
Reward functions in URLVR can be pure binary, composite (including format, structure, and content penalties), or soft model-based. A canonical composite reward function for verifiable QA is:
where:
- captures correctness and format,
- and penalize premature answer revelation and structural non-compliance,
- are tunable weights (Tarek et al., 19 Sep 2025).
Soft reward models may be learned via self-supervised distillation approaches. For example, URLVR can utilize a "generative verifier," a distilled LLM trained to score responses via binary or soft confidence (Su et al., 31 Mar 2025). The standard RL objective with such a learned reward is: where is the (possibly soft) output of a reward-model LLM.
In the code domain, VeRPO constructs dense rewards by weighing each passed unit test according to its empirical rarity and local density, then combining this with a global correctness anchor. This scheme produces both dense and verifiable feedback while maintaining robustness against reward misalignment (Wang et al., 7 Jan 2026).
3. Robustness, Noise, and Reward Hacking Mitigation
A central challenge in URLVR is vulnerability to specification gaming and verifier noise. Verifier hacking can occur via producing correct answers in inappropriate regions of output, or by circumventing output structure rules.
Composite reward schemes explicitly penalize these behaviors, for example by embedding penalties for content leaks or structural violations (Tarek et al., 19 Sep 2025).
Automated verifiers can be imperfect, introducing false positives (FP) and false negatives (FN). These are modeled as a stochastic reward channel with error rates (FP) and (FN):
To recover the true policy gradient, lightweight corrections are applied:
- Backward correction uses a de-noised unbiased estimator:
- Forward correction reweights score-function terms, needing only FN rate estimation:
Empirically, both corrections effectively recover oracle performance under synthetic and real-world verifier noise; the forward variant offers improved stability at high noise rates. FN rates can be estimated online via appeals to a lightweight LLM judge (Cai et al., 1 Oct 2025).
4. Scaling, Generalization, and Practical Algorithms
URLVR frameworks scale to broad domains by distilling robust reward models from high-confidence verifiers without per-domain annotation, enabling application to free-form medical, chemistry, psychology, and educational tasks (Su et al., 31 Mar 2025). Binary verifications remain consistent across LLMs when expert-written references exist, and generative scoring techniques using small (7B) LLMs have proven to match 72B-class teacher model performance.
In open-ended generation tasks, the RLVRR paradigm extends the dot-signal verification to "reward chains" by extracting key content points and style constraints from high-quality references, and applies both content and style-based verifiable signals to guide policy optimization (Jiang et al., 26 Jan 2026). RLVRR achieves superior generalization and diversity compared to SFT and learned-RM RL, with minimal additional computation.
Online data selection using contextual bandit samplers further enhances sample efficiency and robustness by adaptively choosing high-value rollouts for policy updates, with theoretical sublinear regret guarantees (Lu et al., 9 Feb 2026).
In code generation, dense verifiable signals, difficulty weighting, and outcome anchoring enable stable and scalable RL without reliance on external reward models or critics, as evidenced by VeRPO's gains and negligible computational overhead (Wang et al., 7 Jan 2026).
5. Fundamental Limits and Theoretical Insights
Intrinsic URLVR methods, including majority-voting, entropy, and self-certainty rewards, universally drive a sharpening mechanism that amplifies the model's initial answer preference ("confidence->correctness" dynamics). The theoretical analysis demonstrates that under majority stability and effective learning, intrinsic rewards inexorably concentrate the policy on its initial majority answer, regardless of ground-truth correctness (He et al., 9 Mar 2026).
Empirical studies reveal a rise–then–fall dynamic: validation accuracy initially increases but eventually collapses as the model overfits to its prior. Collapse timing is determined by the alignment of model confidence with true correctness, not by algorithmic modifications. The Model Collapse Step, the RL training step at which reward accuracy falls below a threshold, is highly predictive of model trainability and scales closely with RLVR performance across models.
Intrinsic RLVR is robust for small–scale, test–time adaptation on in-domain datasets, but not for large-scale supervised training. Intrinsic methods should be used for localized adaptation, while scalable RLVR must engage external verification mechanisms.
6. Emerging Directions and Limitations
External URLVR methods leverage rapid verification mechanisms for scalable RL, including code execution, mathematical checking, or policy-extracted signals from unlabeled data. Such approaches can decouple model improvement from initial priors and escape the confidence–correctness ceiling imposed by intrinsic rewards.
Preliminary findings confirm that self-verification signals grounded in executable or checkable outputs (e.g., arithmetic, proof checking, end-to-end code execution) yield sustained improvement without collapse, opening the path for scalable, label-efficient RLVR in complex domains (Liao et al., 2 Mar 2026, Wang et al., 7 Jan 2026, He et al., 9 Mar 2026).
Nevertheless, several limitations persist:
- Quality and coverage of automated verifiers fundamentally constrain reward reliability.
- Reward hacking persists at the edges of specification, necessitating continual advancement of composite, reference-based, and style/content-informed verifiable signals (Tarek et al., 19 Sep 2025, Jiang et al., 26 Jan 2026).
- The design of subchecks and the avoidance of trivial partial success require careful curation, as in VeRPO (Wang et al., 7 Jan 2026).
- Scalability to free-form, ambiguous tasks remains limited by reference corpus coverage and verifiable signal construction (Jiang et al., 26 Jan 2026).
7. Empirical Summary and Practical Guidance
Key results across URLVR variants are summarized below:
| Method/Setting | Main Gain/Effect | Source |
|---|---|---|
| Composite rewards | Curb reward hacking (premature/leak) | (Tarek et al., 19 Sep 2025) |
| Generative reward model | Matches or exceeds SOTA with 7B LLM | (Su et al., 31 Mar 2025) |
| RLVRR (open-ended) | Outperforms SFT (10× data), improves diversity | (Jiang et al., 26 Jan 2026) |
| VeRPO (code) | Up to +8.8% pass@1, negligible overhead | (Wang et al., 7 Jan 2026) |
| Intrinsic URLVR | Rise–fall pattern, collapse at large scale | (He et al., 9 Mar 2026) |
| External verifiable RLVR | Sustained improvement, escapes collapse | (He et al., 9 Mar 2026, Liao et al., 2 Mar 2026) |
For practitioners:
- Prefer external verifiable reward mechanisms for large-scale RLVR.
- Use composite and reference-based signals to mitigate reward hacking and expand coverage.
- Measure model prior via Model Collapse Step; use intrinsic RLVR only for small-scale test-time tuning.
- Employ contextual bandit scheduling and dense, difficulty-weighted rewards for improved sample efficiency.
Research in URLVR continues to drive advances in scalable, robust, and efficient reinforcement learning for domains where automated verification is feasible, with ongoing work addressing reward hacking, verifier noise, and generalization beyond structured tasks.