Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding (2501.18362v3)

Published 30 Jan 2025 in cs.AI, cs.CL, cs.CV, and cs.LG

Abstract: We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 18 leading models on \benchmark. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models. Code and data are available at: https://github.com/TsinghuaC3I/MedXpertQA

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

The paper introduces MedXpertQA, an advanced benchmarking framework aimed at evaluating and enhancing the capabilities of AI models in medical reasoning and understanding. MedXpertQA stands out as a highly challenging and comprehensive benchmark that includes 4,460 questions across 17 medical specialties and 11 body systems. The benchmark is structured into two subsets: one for text-only evaluations (MedXpertQA-Text) and another for multimodal assessments (MedXpertQA-Multi).

Key Contributions and Methodology

MedXpertQA addresses several inadequacies found in existing medical AI benchmarks. Current benchmarks often lack sufficient difficulty, specialty-specific evaluations, and the ability to simulate real-world diagnostic complexities. MedXpertQA overcomes these limitations by integrating expert-level exam questions that include diverse images and contextual clinical information, such as patient records and examination results.

The development of MedXpertQA involved rigorous data curation, filtering, and augmentation processes:

  1. Data Curation and Filtering: The authors curated questions from professional medical exams and textbooks, including USMLE, COMLEX-USA, and 17 American specialty board exams. They employed adaptive filtering using Brier scores and semantic similarity measures to ensure the selected questions challenge both humans and AI models.
  2. Data Synthesis and Expert Review: To mitigate data leakage risks, the paper describes a data augmentation process that involves rewriting questions and options. This was followed by multiple rounds of expert reviews to ensure the accuracy and validity of the benchmark content.
  3. Multimodal Assessment: MedXpertQA-Multi provides a multimodal benchmark that includes diverse image types and real-world clinical scenarios to simulate the broad spectrum of visual and textual information encountered in medical practice.

Evaluation and Results

The benchmark was used to evaluate 16 leading AI models, both proprietary and open-source, including inference-time scaled models like OpenAI's o1 model and others. The results demonstrate that current models still struggle with complex medical reasoning tasks, indicating a substantial gap in expert-level reasoning capabilities. Specifically, even the most advanced models achieved limited performance, particularly on reasoning-heavy subsets.

Implications and Future Directions

MedXpertQA has significant implications for the development of AI in healthcare. By setting a higher standard for medical AI benchmarking, it paves the way for more robust and clinically relevant AI applications. From a theoretical standpoint, the benchmark highlights the importance of integrating comprehensive and challenging evaluations to push AI boundaries in specialized domains like medicine.

Future developments could leverage the insights gained from MedXpertQA to inform reinforcement learning strategies, improve multimodal AI capabilities, and guide the design of models that better synthesize complex medical data. Additionally, MedXpertQA could serve as a template for developing benchmarks in other specialized fields that require advanced machine reasoning.

In summary, MedXpertQA represents a significant step forward in benchmarking AI for expert-level medical reasoning and understanding. Its rigorous design ensures comprehensive coverage and presents a meaningful challenge to current AI systems, driving further advancements in the field.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yuxin Zuo (11 papers)
  2. Shang Qu (7 papers)
  3. Yifei Li (92 papers)
  4. Zhangren Chen (3 papers)
  5. Xuekai Zhu (12 papers)
  6. Ermo Hua (16 papers)
  7. Kaiyan Zhang (33 papers)
  8. Ning Ding (122 papers)
  9. Bowen Zhou (141 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com