Med-RLVR Framework for Medical Reasoning

Updated 22 December 2025

The paper introduces Med-RLVR, which unifies deterministic reward functions with curriculum-guided reinforcement learning for robust medical reasoning.
It leverages transformer-based architectures and PPO-like optimization to generate structured responses for modalities such as QA, VQA, and VIE.
Empirical results demonstrate significant in-domain and OOD improvements, with accuracy gains up to 11.4% and enhanced performance in low-annotation settings.

The Med-RLVR (Medical Reinforcement Learning with Verifiable Rewards) framework comprises a class of reinforcement learning (RL) approaches that target robust, clinically relevant reasoning in medical LLMs and multimodal vision-LLMs (VLMs). Med-RLVR systems unify rule-based, deterministic reward functions with curriculum- or data-driven RL protocols, incentivizing explicit reasoning patterns and out-of-distribution (OOD) generalization without reliance on handcrafted chain-of-thought (CoT) supervision. These frameworks are central to several recent state-of-the-art medical LLMs and VLMs, facilitating curriculum-guided skill acquisition, expert-level inference, and domain-specific extraction in medical question answering (QA), visual question answering (VQA), visual information extraction (VIE), and retrieval-augmented reasoning.

1. Foundations and Problem Scope

Med-RLVR frameworks formalize medical reasoning tasks as episodic Markov Decision Processes (MDPs), where each episode corresponds to generating a reasoning trace and/or answer sequence for a medical prompt—textual, visual, or multimodal—using a sequence of token-level actions. States encode the evolving sequence of generated tokens, images, external retrievals, and any supporting schema or structured context. The agent’s policy $\pi_\theta$ is a (multimodal) autoregressive LLM parameterized by $\theta$ .

A defining feature is the use of “verifiable” rewards: deterministic, rule-based functions derived from answer correctness, response format, adherence to prescribed schema (e.g., XML/JSON structures), and, where relevant, clinical knowledge or extraction quality (such as measured by precision-recall on extracted fields or answer similarity metrics like ROUGE/BERTScore). Unlike human-centric reward models, these verifiers are robust to “reward hacking,” can be computed without learned critics, and enable scaling RL to thousands of samples in diverse clinical applications (Zhang et al., 27 Feb 2025, Rui et al., 25 May 2025, Liu et al., 16 Jun 2025).

Med-RLVR has been instantiated for:

Multiple-choice and open-ended medical QA (text and image modalities)
Medical Visual Information Extraction (VIE) from documents and reports
Multimodal reasoning (e.g., VQA over radiology/image datasets)
Retrieval-augmented and agentic clinical reasoning

2. Model Architectures and RL Algorithms

The core Med-RLVR protocol typically utilizes a transformer-based LLM (e.g., Qwen2.5-VL, LLaMA3.x, Gemma) as the reasoning agent. Vision-based variants employ BLIP-2/LLaVA-style architectures, with a frozen vision transformer (ViT) and Q-former cross-modal fusion for patch-level image features (Rui et al., 25 May 2025).

Policy optimization in Med-RLVR eschews value function estimation and instead leverages Group Relative Policy Optimization (GRPO) or PPO-like clipped surrogate objectives, driven by group-wise normalized advantages. Specifically, for each prompt:

$G$ candidate rollouts $\{o_i\}$ are sampled from the current or previous policy.
Each rollout is assigned a deterministic, verifiable reward $r_i$ .
The advantage is normalized within group: $A_i = (r_i - \mathrm{mean}(\{r_j\}))/\mathrm{std}(\{r_j\})$ .
The policy is updated by maximizing:

$J(\theta) = \mathbb{E}\left[\frac1G \sum_{i=1}^G \min\left(\frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)} A_i,\; \operatorname{clip}\left(\frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)}, 1-\epsilon, 1+\epsilon\right) A_i\right) - \beta\, \mathrm{KL}[\pi_\theta \| \pi_{\text{ref}}]\right]$

Hyperparameters are problem- and model-specific but typically include small KL penalties ( $\beta=0.01$ ), clipping thresholds ( $\epsilon=0.2$ ), group sizes ( $G=8$ –$16$), and learning rates on the order of $10^{-6}$ – $10^{-5}$ (Rui et al., 25 May 2025, Huang et al., 4 Aug 2025, Liu et al., 18 Sep 2025).

Sampling strategies (e.g., field-subset vs. full-schema for VIE (Liu et al., 16 Jun 2025)) and curriculum schedules (close-ended $\to$ open-ended; easy $\to$ hard) are central for training stability in medical domains.

3. Reward Engineering and Verifiability

Reward functions in Med-RLVR are designed for complete verifiability and task specificity:

QA/VQA: Binary reward for exact match with ground truth, possibly with additional format checks enforcing > …<answer>…</answer> templates (Zhang et al., 27 Feb 2025, Huang et al., 4 Aug 2025, Liu et al., 18 Sep 2025).
VIE: Weighted sum of precision and recall over extracted key–value pairs from JSON outputs, regularized by format incentives (correct presence of structured “thinking” and “answer” blocks) (Liu et al., 16 Jun 2025).
Open-ended tasks: Lexical and semantic similarity metrics (BLEU, ROUGE, BERTScore), often mixed with exact-match and structure constraints (Rui et al., 25 May 2025).
Retrieval-augmented reasoning: Additional rewards measuring retrieval quality (e.g., evidence level), semantic alignment, logical path overlap, and confidence gain from retrieved context (Lu et al., 31 Jul 2025, Wang et al., 21 Oct 2025).

The use of deterministic reward parsing (e.g., string-matching, schema validation) enhances reproducibility and prevents exploitation by adversarial outputs.

4. Curriculum and Training Strategies

A central principle of Med-RLVR is curriculum-aware or staged reinforcement learning. Training is partitioned in sequenced stages that stabilize policy acquisition and mitigate conflicting reward signal gradients:

Close-ended $\to$ open-ended stages: Initial training on discrete, low-variance close-ended VQA or QA (yes/no, multiple-choice) to anchor domain foundations, followed by open-ended, high-diversity reasoning (Rui et al., 25 May 2025).
Skill consolidation $\to$ hard-case mining: After skill acquisition on easier samples, RL is focused on difficult or repeatedly failed samples, boosting performance on challenging clinical reasoning (Liu et al., 18 Sep 2025).
Field/subset $\to$ full-schema alternation for extraction: Curriculum over extracted fields (few $\to$ all) improves learning efficiency and reasoning quality in annotation-limited regimes (Liu et al., 16 Jun 2025).
Expert-data bootstrapping: Pretraining on distilled CoT or SFT traces (“CoT cold-start”) before RL can further enhance verifiability and logical coherence (Liu et al., 18 Sep 2025, Lin et al., 30 May 2025).

These staging strategies are empirically shown to yield 3–8 point gains in both in-distribution and OOD benchmarks, and to prevent catastrophic forgetting or gradient collapse seen in joint or “flat” RLVR approaches.

5. Empirical Performance, Ablations, and Limitations

Med-RLVR frameworks have demonstrated:

Substantial in-domain and OOD improvements versus supervised finetuning (SFT) and naïve RL baselines; e.g., in VQA, curriculum RL yields +11.4% absolute accuracy versus SFT on in-domain, +5.7% on OOD tasks (Rui et al., 25 May 2025).
High robustness in low-annotation (<100 samples) regimes for structured extraction, with F1 boosts >10 points by combining sampling, reasoning incentives, and balanced reward design (Liu et al., 16 Jun 2025).
For text and multimodal reasoning, consistent advantages for RLVR over SFT, especially with careful difficulty-centric data curation; pure SFT on CoT traces can degrade performance (Huang et al., 4 Aug 2025).

Ablations consistently confirm:

Curriculum and staged RL treatments outperform joint or random task mixtures.
Data filtering (using larger teacher or same-family LLMs) for question selection yields better domain-specific performance, but aggressive self-filtering may trade off generalization (Qiu et al., 16 Apr 2025).
Structured outputs (e.g., enforcing strict templates) are essential for reward verifiability and RL convergence across clinical reasoning, extraction, and retrieval tasks.

Limitations:

RLVR remains sample-inefficient for open-ended generation.
Reward function coverage is limited by domain-specific knowledge; RL cannot compensate for missing foundational knowledge.
Automatic evaluation of deep clinical reasoning (multi-hop, temporal, or abstract) is unsolved; reward proxies may not fully capture reasoning correctness or safety (Rui et al., 25 May 2025, Liu et al., 18 Sep 2025).

6. Integration of Domain Knowledge and Interpretability

Med-RLVR injects and maintains clinical validity via:

Rule-based reward verifiers encoding clinical guidelines (e.g., matching radiological findings, disease markers, guideline-concordant answers) (Rui et al., 25 May 2025).
Consistency auditors and data refinement, which inspect and revise under-specified or hallucinated training data (e.g., 72B Qwen2.5 consistency auditor (Rui et al., 25 May 2025)).
Structured output formats that separate rationales from final decisions, easing medical interpretability and downstream auditability by clinicians and validators.
Absence of explicit knowledge graph in RL loops; domain knowledge is instead embedded via curated datasets, reward checkers, and format constraints.

A plausible implication is that the Med-RLVR paradigm’s insistence on verifiable outputs and structured reasoning may enhance the trust and reliability of medical LLMs in safety-critical settings, a hypothesis under active exploration in clinical deployment studies.

7. Broader Impact, Best Practices, and Future Directions

Med-RLVR frameworks constitute a reproducible, scalable approach for cultivating robust, generalizable, and interpretable reasoning in medical LLMs and VLMs (Zhang et al., 27 Feb 2025, Rui et al., 25 May 2025, Liu et al., 16 Jun 2025, Huang et al., 4 Aug 2025, Liu et al., 18 Sep 2025, Lu et al., 31 Jul 2025, Wang et al., 21 Oct 2025, Lin et al., 30 May 2025, Qiu et al., 16 Apr 2025). Key operational recommendations include:

Pre-filter data using intermediate or teacher models to select mid-difficulty samples for RLVR training.
Employ deterministic, rule-based reward functions, and enforce output templates for reasoning chains and answers.
Use staged curriculum and hard-case mining to mitigate RL signal sparsity and overfitting.
Regularize RL training with KL and PPO-style clipping to avoid policy collapse.
Benchmark domain-specific and OOD performance, and analyze per-category robustness.

Open challenges persist in designing reward functions for deeper clinical rationales, evaluating hallucination and faithfulness in real-world medical workflows, and scaling Med-RLVR to rich, multimodal electronic health record reasoning. Future work will likely integrate knowledge-graph–guided reward verification, human-in-the-loop evaluation, and safety monitoring to meet clinical deployment standards.