DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models

Published 20 May 2025 in cs.CL and cs.AI | (2505.14107v4)

Abstract: The emergence of groundbreaking LLMs capable of performing complex reasoning tasks holds significant promise for addressing various scientific challenges, including those arising in complex clinical scenarios. To enable their safe and effective deployment in real-world healthcare settings, it is urgently necessary to benchmark the diagnostic capabilities of current models systematically. Given the limitations of existing medical benchmarks in evaluating advanced diagnostic reasoning, we present DiagnosisArena, a comprehensive and challenging benchmark designed to rigorously assess professional-level diagnostic competence. DiagnosisArena consists of 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties, deriving from clinical case reports published in 10 top-tier medical journals. The benchmark is developed through a meticulous construction pipeline, involving multiple rounds of screening and review by both AI systems and human experts, with thorough checks conducted to prevent data leakage. Our study reveals that even the most advanced reasoning models, o3, o1, and DeepSeek-R1, achieve only 51.12%, 31.09%, and 17.79% accuracy, respectively. This finding highlights a significant generalization bottleneck in current LLMs when faced with clinical diagnostic reasoning challenges. Through DiagnosisArena, we aim to drive further advancements in AI's diagnostic reasoning capabilities, enabling more effective solutions for real-world clinical diagnostic challenges. We provide the benchmark and evaluation tools for further research and development https://github.com/SPIRAL-MED/DiagnosisArena.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper demonstrates that current LLMs achieve modest accuracy on complex diagnostic tasks, with models scoring 17.79% to 45.82%.
The benchmark is built via a rigorous four-stage pipeline, turning clinical case reports into standardized, reasoning-focused evaluations across 28 specialties.
Results show that converting the task into multiple-choice questions improves performance, highlighting the potential oversimplification of traditional formats.

DiagnosisArena: Benchmarking Diagnostic Reasoning for LLMs

Introduction

The paper "DiagnosisArena: Benchmarking Diagnostic Reasoning for LLMs" (2505.14107) introduces a comprehensive benchmark aimed at evaluating the diagnostic reasoning abilities of LLMs. DiagnosisArena is designed to address the limitations of existing medical benchmarks, which often fail to adequately challenge the reasoning capabilities of LLMs in diagnostic settings. It comprises 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties derived from clinical case reports published in top-tier medical journals. The benchmark, constructed through a meticulous pipeline, highlights a significant generalization bottleneck in current LLM capabilities when faced with complex clinical diagnostic reasoning challenges.

Benchmark Construction

DiagnosisArena was developed through a four-stage pipeline which includes data collection, data structuring, iterative filtering, and expert-AI collaborative verification.

Figure 1: Overview of the DiagnosisArena Benchmark. (a) The pipeline for constructing the DiagnosisArena dataset consists of four stages: data collection from journals, data structuring, iterative filtering of non-reasoning examples, and expert-AI collaborative verification. (b) DiagnosisArena is sourced from 10 top-tier medical journals. (c) It covers 28 medical specialties. (d) DiagnosisArena offers information-dense clinical cases with greater reasoning complexity.

Data Collection: DiagnosisArena sources its 1,113 clinical cases from case reports published in leading medical journals such as Lancet, NEJM, and JAMA. These journals provide real-world, challenging diagnostic cases with rich contextual information essential for effective AI model evaluation.
Data Structuring: The raw case reports are transformed into standardized segmented formats, ensuring the inclusion of only diagnostic-relevant content. This process involves rule-based filtering and AI-based structuring, resulting in detailed case information, physical examination findings, diagnostic test results, and ground truth final diagnoses.
Iterative Filtering and Expert-AI Collaborative Verification: Cases are screened to ensure complexity and logical consistency, involving both AI and human reviewers. The benchmark excludes cases whose diagnoses can be easily inferred from memorized facts or lack sufficient diagnostic complexity.

Experimental Results

The experimental analysis in the paper demonstrates that current SOTA reasoning models, such as o3-mini, o1, and DeepSeek-R1, struggle significantly on DiagnosisArena, achieving modest accuracy rates of 45.82\%, 31.09\%, and 17.79\% respectively. These results underscore the difficulties faced by LLMs in handling the intricacies of clinical reasoning.

Figure 2: Performance of SOTA models on DiagnosisArena and other benchmarks.

Additionally, converting the diagnostic task into multiple-choice questions revealed marked performance improvements, highlighting a potential oversimplification in traditional benchmark formats. For instance, in the multiple-choice setting, o1 reached 61.90\% accuracy, showcasing the potential reduction in task difficulty when constraining the answer space.

Figure 3: Performance of Different Models on DiagnosisArena. (a) The Top- $k$ metric represents the proportion of cases where the correct answer is included among the Top- $k$ predictions, ranked in descending order of confidence. (b) The MCQ presents the multiple-choice version of DiagnosisArena, with o1 reaching 61.90\% accuracy.

Data Leakage Study

The paper explores potential data leakage by analyzing model performance against case publication dates, ensuring robustness in evaluation results. The study concludes that data leakage is rare and does not negatively impact model assessment accuracy.

Figure 4: Leakage Detection on DiagnosisArena.

Case Study

Analyzing a specific case from DiagnosisArena reveals the diagnostic challenges LLMs face. For example, DeepSeek-R1 misinterpreted critical diagnostic features due to over-reliance on reasoning paths of common diseases, underscoring the limited adaptation of current models to complex medical reasoning requirements.

Figure 5: A Case Study of DiagnosisArena. Except for o3-mini, which provided the correct answer, other models missed the correct diagnosis.

Conclusion

DiagnosisArena serves as a pivotal step toward refining diagnostic reasoning capabilities of LLMs in clinical settings. By presenting a rigorous benchmark, it reveals current limitations and directs future research towards enhancing reasoning abilities to support complex medical diagnostics effectively. This benchmark aims to drive further advancements in AI’s diagnostic proficiency, ultimately contributing to safer and more effective integration of AI systems into real-world healthcare environments.

Markdown Report Issue