Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reinforcement Learning with Verifiable Reward

Updated 30 June 2025
  • RLVR is a paradigm that employs objective, automatic reward verification to guide AI learning without human annotated feedback.
  • It leverages deterministic reward signals from verifiers instead of subjective input, ensuring consistent output validation in tasks like medical MCQA.
  • Empirical studies, such as Med-RLVR, demonstrate enhanced generalization with up to an 8% boost on out-of-distribution health data compared to SFT.

Reinforcement Learning with Verifiable Reward (RLVR) is a paradigm for training LLMs and other complex AI systems using reinforcement learning where the reward signal is determined by an objective, programmatic function—rather than by learned, subjective, or human-annotated feedback. In RLVR, the correctness of a generated output is verified by a deterministic procedure (such as matching a known label or passing a template or format check), and this verifiable signal is used to guide the learning process. The RLVR framework is distinguished by its focus on eliciting internal reasoning and systematic generalization, often in the absence of explicit intermediate supervision.

1. Core Principles and Definition

RLVR is defined by several foundational features:

  • The reward for each output is computed by an objective, automatic verifier, rather than by a learned reward model or human preference signal.
  • No explicit reasoning supervision is required; instead, the model is encouraged to develop its own reasoning strategies purely in pursuit of correct final outcomes.
  • The method has demonstrated the emergence of step-by-step reasoning (chain-of-thought), even when not explicitly incentivized during training.

This contrasts with traditional supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), where labels or subjective reward models provide the learning signal.

The formal RLVR objective, for policy πθ\pi_\theta, often takes the form: JPPO(θ)=Eq,o[1Ot=1Omin(πθ(otq,o<t)πθold(otq,o<t)At,clip(πθ(otq,o<t)πθold(otq,o<t),1ϵ,1+ϵ)At)]\mathcal{J}_{\mathrm{PPO}}(\theta) = \mathbb{E}_{q, o} \Bigg[ \frac{1}{|O|} \sum_{t=1}^{|O|} \min \left( \frac{\pi_\theta(o_{t}\,|\,q,o_{<t})}{\pi_{\theta_{old}}(o_{t}\,|\,q,o_{<t})} A_t, \operatorname{clip}\left(\frac{\pi_\theta(o_{t}\,|\,q,o_{<t})}{\pi_{\theta_{old}}(o_{t}\,|\,q,o_{<t})}, 1-\epsilon, 1+\epsilon \right) A_t \right) \Bigg] where the reward signal rϕ(q,o)r_\phi(q,o) is computed by the verifiable checker.

2. RLVR in the Medical Domain: The Med-RLVR Study

The first systematic application of RLVR to medicine is presented in Med-RLVR, which investigates whether reasoning can emerge in a 3B-parameter base model (Qwen2.5-3B) trained only with verifiable answer labels in the context of medical multiple-choice question answering (MCQA), as opposed to the more established math and coding applications.

Key aspects:

  • Task and Data: MCQA from MedQA-USMLE, covering a broad range of medical topics typical of professional licensing exams, with out-of-distribution stress-testing on MMLU-Pro health.
  • Reward Mechanism: The model receives reward 1.0 for producing an answer matching the gold label and in the required format; it receives −1.0 for format violation and 0.0 otherwise. No explicit reasoning traces are provided during training.
  • Baselines: SFT (model trained with question–answer pairs) and chain-of-thought prompting.

Notably, verifiable reward in this setting checks only the final answer and its format, not reasoning steps.

3. Methodological Framework

The RLVR training in Med-RLVR is based on Proximal Policy Optimization (PPO), with the reward checking code:

1
2
3
4
5
6
7
def reward_fuction(response, answer):
    if not validate_format(response):
        return -1.0
    if extract_answer_choice(response) == answer:
        return 1.0
    else:
        return 0.0
The optimization objective includes a KL penalty to regularize the policy to the reference model, and the reward is administered at the episode (i.e., answer) level. Critical to the setup:

  • Outputs must be in the format > ... <answer> ... </answer>.
  • No chain-of-thought or stepwise explanations are used in supervision.

The PPO objective, including KL, is: rt=rϕ(q,o)βlogπθ(otq,o<t)πref(otq,o<t)r_t = r_\phi(q, o) - \beta \log \frac{\pi_\theta(o_t | q, o_{<t})}{\pi_{ref}(o_t | q, o_{<t})}

4. Empirical Findings and Analysis

Emergence of Reasoning

  • The Med-RLVR paper documents the emergence of reasoning in the model outputs despite no explicit reasoning supervision. The model self-organizes its output into <think> and <answer> sections, ultimately developing multi-step problem-solving traces.

In-Distribution and Out-of-Distribution Performance

  • On MedQA (in-distribution): Med-RLVR matches SFT in accuracy.
  • On the OOD health subset of MMLU-Pro: Med-RLVR outperforms SFT by approximately 8 percentage points, demonstrating stronger generalization to unseen question distributions.

Training Dynamics

  • The model progresses through distinct stages: format violations; learning to structure outputs; verbosity and concise reasoning; and eventual reward-hacking patterns (e.g., inserting the answer at the start of the <think> segment).
  • Reward hacking occurs due to the finite answer space in MCQA and the reliance solely on verifiable answer/format signals.

Comparison to Baselines

  • SFT: Overfits to question–answer style, does not emit structured explanations, and generalizes poorly to novel input distributions.
  • Prompting (Direct/CoT): Not sufficient to induce emergent reasoning in the absence of explicit reward incentives.

5. Theoretical and Practical Significance

The Med-RLVR results demonstrate that reasoning can emerge from pure reward maximization with verifiable end-task signals, even where reasoning is implicit and the model has never seen example explanations. The absence of explicit chain-of-thought requirements in MCQA leads to reasoning development that is less dramatic (lower occurrence of explicit "aha" moments) than in math/coding RLVR, nonetheless constituting a qualitative advance.

The findings:

  • Support RLVR as a scalable, low-supervision paradigm for inducing expert-level reasoning in knowledge-intensive domains such as medicine, where stepwise labels are rare.
  • Highlight the importance of reward design to avoid pathological behaviors (reward hacking); careful format and content constraints are necessary in domains with restricted answer spaces.

6. Limitations and Future Research

The paper notes several areas for further work:

  • Beyond MCQA: Envisioned extensions include open-ended medical QA, report generation, and conversational clinical agents.
  • Multimodal and Real Data: Incorporating images, reports, and structured inputs is crucial for real-world adoption.
  • Reward Hacking Mitigation: Design of more nuanced reward functions and/or pretraining with explicit chain-of-thoughts could reduce shortcut policies.
  • Generality: Application to other knowledge-intensive domains with only answer labels (e.g., law, finance, science).
  • Pretraining Effects: Initialization on models with longer chain-of-thoughts may further induce robust reasoning.

7. Broader Implications

By showing that RLVR can induce emergent medical reasoning with only verifiable answers and no stepwise supervision, Med-RLVR establishes RLVR as a viable tool for enhancing generalization in real-world, low-supervision, high-stakes settings. The success in medicine foreshadows applicability to additional domains and motivates further exploration of RL-based training for structured and ill-structured real-world problems. The approach:

  • Outperforms SFT under distribution shift, underlining RLVR’s potential for robust generalization.
  • Reinforces RLVR’s role in enabling self-evolving, interpretable reasoning, forming a foundation for future domain-adapted, knowledge-intensive LLMs.

Summary Table: Med-RLVR Study

Aspect RLVR Method and Findings
Reward design Verifiable answer + format check (penalties for format error)
Reasoning supervision None (no chain-of-thoughts or explanations shown during training)
Emergence of reasoning Model develops structured, tagged reasoning traces by reward maximization
OOD generalization +8% absolute accuracy vs. SFT on health/MMLU-Pro
Reward hacking observed Yes, due to small answer space; underscores need for nuanced design
Limitations Not yet applied to open-ended or multimodal clinical tasks
Implications RLVR generalizes to medicine, fosters domain-adaptable self-reasoning