ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning (2506.09513v1)

Published 11 Jun 2025 in cs.CL, cs.AI, and cs.MA

Abstract: Though reasoning-based LLMs have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a \textit{multi-agent verification and refinement process}, where we design an \textit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.

Authors (10)

Yu Sun (226 papers)
Xingyu Qian (3 papers)
Weiwen Xu (19 papers)
Hao Zhang (948 papers)
Chenghao Xiao (21 papers)
Long Li (113 papers)
Yu Rong (146 papers)
Wenbing Huang (95 papers)
Qifeng Bai (2 papers)
Tingyang Xu (55 papers)

Summary

Analyzing ReasonMed: A 370K Multi-Agent Generated Dataset for Medical Reasoning

The paper "ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning," introduces a novel approach to address the limitations of LLMs in the medical domain, particularly in the area of complex question answering. Despite the proficiency of LLMs in domains such as mathematics and programming, their capabilities in handling knowledge-intensive medical queries have remained unsatisfactory due to inadequate datasets. The authors tackle this problem by presenting ReasonMed, a comprehensive medical reasoning dataset, which stands out for its significant scale and refinement, specifically crafted to bolster the performance of reasoning models in medical question answering (QA).

Dataset Construction and Methodology

ReasonMed is the largest of its kind, consisting of 370,000 verified medical reasoning samples derived from an initial pool of 1.7 million reasoning paths created by several LLMs. The construction process leverages a multi-agent system comprising three competitive LLMs: Qwen-2.5-72B, DeepSeek-R1-Distill-Llama-70B, and HuatuoGPT-o1-70B. This ensemble generated diverse reasoning paths, which were subsequently subjected to a rigorous verification and refinement pipeline. An Error Refiner was deployed to enhance reasoning paths by identifying and rectifying error-prone steps, thus optimizing the dataset's validity.

The methodology emphasizes the combination of detailed chain-of-thought (CoT) reasoning along with concise answer summaries, which proved to be the most effective fine-tuning strategy for medical reasoning models. This strategic hybrid approach yields a significant enhancement, evidenced by the trained ReasonMed-7B model, which notably exceeds the accuracy of its predecessors by 4.17% and outperforms the LLaMA3.1-70B on PubMedQA by 4.60%.

Evaluation Strategy and Results

The multi-agent framework not only amplifies the dataset's depth but also enriches its coverage by tapping into diverse medical insights. Each datum in ReasonMed comprises both detailed CoT reasoning and a summary answer, facilitating an analysis of effective reasoning patterns. The hybrid approach utilized in fine-tuning demonstrates that explicit reasoning supervision is crucial for boosting LLM performance in medical QA. ReasonMed-7B, outcomes are remarkable, establishing a new benchmark for models under the sub-10B parameters category.

During the verification process, questions were sorted into tiered categories of difficulty based on validation pass rates. This three-tiered system included easy, medium, and difficult questions, each processed differently to maintain stringent data refinement standards. Consistent high-quality data production led to significant performance enhancements compared to existing datasets like medical-o1-reasoning-SFT and Medical-R1-Distill-Data.

Implications and Future Directions

Practically, ReasonMed could serve as a bedrock for developing more competent medical QA systems, enhancing diagnostic tools, and potentially serving as auxiliary decision-support systems in clinical settings. Theoretically, it emphasizes the importance of dataset quality and inter-model ability integration for advancing AI's reasoning capabilities. The paper argues convincingly for the potential of multi-agent systems in generating high-quality, diverse datasets.

Looking ahead, the methodological insights garnered from this paper pave the way for the further exploration of multi-agent collaborations in different knowledge-intensive domains. While the current implementation is focused on medical reasoning, the principles can extend to other fields requiring nuanced understanding and robust inferential capabilities. Furthermore, scaling these methodologies to even larger models could yield additional performance benefits, offering a wider scope for real-world applications and collaborations.

In conclusion, the ReasonMed dataset represents a significant step forward in addressing the challenges of medical reasoning within LLM frameworks. Its construction methodology, coupled with substantive results against existing benchmarks, establishes it as a pivotal resource for both current applications and future AI innovations in the medical domain.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/communicating/status/1933448209852215647

https://twitter.com/ResearchBitesAI/status/1933539940845171062

https://twitter.com/MLexpAI/status/1933944764870865221

YouTube

Show All Videos