Open-Medical-R1: How to Choose Data for RLVR Training at Medicine Domain (2504.13950v1)

Published 16 Apr 2025 in cs.LG and cs.AI

Abstract: This paper explores optimal data selection strategies for Reinforcement Learning with Verified Rewards (RLVR) training in the medical domain. While RLVR has shown exceptional potential for enhancing reasoning capabilities in LLMs, most prior implementations have focused on mathematics and logical puzzles, with limited exploration of domain-specific applications like medicine. We investigate four distinct data sampling strategies from MedQA-USMLE: random sampling (baseline), and filtering using Phi-4, Gemma-3-27b-it, and Gemma-3-12b-it models. Using Gemma-3-12b-it as our base model and implementing Group Relative Policy Optimization (GRPO), we evaluate performance across multiple benchmarks including MMLU, GSM8K, MMLU-Pro, and CMMLU. Our findings demonstrate that models trained on filtered data generally outperform those trained on randomly selected samples. Notably, training on self-filtered samples (using Gemma-3-12b-it for filtering) achieved superior performance in medical domains but showed reduced robustness across different benchmarks, while filtering with larger models from the same series yielded better overall robustness. These results provide valuable insights into effective data organization strategies for RLVR in specialized domains and highlight the importance of thoughtful data selection in achieving optimal performance. You can access our repository (https://github.com/Qsingle/open-medical-r1) to get the codes.

Summary

Overview of Optimal Data Selection Strategies for RLVR in Medicine

The paper "Open-Medical-R1: How to Choose Data for RLVR Training at Medicine Domain" explores strategies for optimizing data selection in the context of Reinforcement Learning with Verified Rewards (RLVR), specifically within medical applications. The research investigates how data sampling influences the performance of RLVR-integrated models in both medicine and general reasoning tasks, presenting nuanced insights into data filtering methods and their consequent impacts.

Objective and Methodology

The primary objective of the paper is to determine best practices for data selection when training LLMs with RLVR trained on medical datasets, notably MedQA-USMLE. The paper assesses four data sampling strategies:

Random Sampling: Serving as the baseline approach.
Model-Based Filtering: Utilizing three models—Phi-4, Gemma-3-27b-it, and Gemma-3-12b-it—to filter the data based on predefined criteria.

The paper employs the Gemma-3-12b-it model along with Group Relative Policy Optimization (GRPO) for training and evaluates performance across multiple benchmarks such as MMLU, GSM8K, MMLU-Pro, and CMMLU.

Key Findings

Performance Improvements: Models trained on filtered data generally outperform those trained on random samples. This suggests that thoughtful data selection is crucial for optimizing RLVR-based learning.
Specialized Performance vs. General Robustness: Training on self-filtered data using the Gemma-3-12b-it model resulted in improved performance within the medical domain but showed diminished robustness across diverse benchmarks. Conversely, using larger models from the same series for filtering contributed to better overall robustness.
Sample Selection Strategy: The paper confirms that strategic data selection enhances model learning efficiency and reasoning capabilities. Specifically, filtering data through sophisticated sampling methods can lead to an optimal balance between domain-specific performance and general reasoning capabilities.

Implications and Future Directions

The findings underscore the importance of data quality and composition in developing robust reasoning patterns within LLMs. For practitioners and researchers in AI, this paper offers a critical perspective on how data sampling methodologies significantly impact model performance, particularly in complex domains like medicine.

Moving forward, there is potential to explore more advanced data sampling techniques or integrate additional capabilities, such as search tools during training, to further bolster LLMs' reasoning and accuracy across domain-specific tasks. Additionally, full-parameter training could yield further insights into optimizing model learning given the nuances of filtered data.

In summary, the research presented in "Open-Medical-R1" provides meaningful contributions toward the understanding of data-driven strategies in RLVR applications in the medical field, with broader implications for AI development in specialized domains.

Related Papers

GitHub

GitHub - Qsingle/open-medical-r1: This repository is aim to reproduce the R1-Zero on medical domain. (21 stars)