Papers
Topics
Authors
Recent
2000 character limit reached

DR.BENCH Diagnostic Reasoning Benchmark

Updated 9 February 2026
  • DR.BENCH is a clinically motivated benchmark suite that integrates text and multimodal data to assess diagnostic reasoning capabilities in AI models.
  • The suite standardizes six key tasks, including EMRQA and MedQA, with unified prompt formatting to enable reproducible evaluations.
  • It leverages diverse data sources and rigorous preprocessing to drive advancements in both clinical text and dental multimodal diagnostic assessments.

DR.BENCH (Diagnostic Reasoning Benchmark) is a comprehensive, clinically motivated suite of generative and classification benchmarks designed to evaluate and drive progress in artificial intelligence systems for diagnostic and medical reasoning across both clinical NLP and multimodal vision-language tasks. Developed to capture the complexity of clinical decision-making, DR.BENCH includes tasks grounded in diverse data modalities—text, images, and structured electronic health records (EHR)—and is structured to support rigorous model assessment and comparison. The primary DR.BENCH suite targets clinical text-based reasoning (Sharma et al., 2023, Gao et al., 2022), while an independent dataset under the same acronym addresses oro-dental multimodal diagnostics (Lv et al., 7 Nov 2025).

1. Scope and Motivation

The creation of DR.BENCH is motivated by the need to move beyond entity recognition and relation extraction towards the computational modeling of clinical diagnostic reasoning. Diagnostic error is a leading source of patient harm; cognitive overload and information-processing bottlenecks in healthcare systems are significant contributors. Existing benchmarks have largely sidestepped the core cognitive tasks of reasoning from patient data to diagnoses (Gao et al., 2022). DR.BENCH was constructed to fill this gap, offering standardized generative tasks that span such cognitive skills as medical knowledge representation, evidence integration, and diagnosis generation. This design encourages the development of models that robustly replicate human-like diagnostic inference and summarizes the current limitations and opportunities in clinical NLP and medical AI (Sharma et al., 2023, Gao et al., 2022).

2. Benchmark Design and Task Structure

DR.BENCH consists of six primary generative tasks, each cast in a text-to-text scheme with unified prompt formatting to enable consistent model training and evaluation. These tasks sample the key cognitive skill domains required in clinical reasoning:

  1. Clinical Text Understanding and Integration
    • EMRQA (Extractive Clinical QA): Models answer evidence-seeking questions over de-identified discharge summaries from the EMRQA corpus, leveraging i2b2 challenge data (≥42,607 training examples) (Gao et al., 2022).
    • SOAP Section Labeling: Models classify progress-note lines into BIO-style tags corresponding to Subjective, Objective, Assessment, or Plan sections, trained on over 100,000 examples (Gao et al., 2022).
  2. Medical Knowledge Representation and Reasoning
    • MedNLI: Clinical natural language inference on MIMIC-III sentences, requiring models to judge entailment, contradiction, or neutrality in premise-hypothesis pairs (Gao et al., 2022).
    • Assessment–Plan Relation Labeling: Predicts the causal or relevance relation between an assessment statement and a treatment plan excerpt, based on N2C2 2018 annotations (Gao et al., 2022).
  3. Diagnosis Generation and Summarization
    • MedQA: Multi-choice, USMLE-style board exam medical question answering, with both closed-book and open-book variants (contextualized with textbook retrieval) (Gao et al., 2022).
    • Problem Summarization: Abstractive summarization of problem lists from progress notes (MIMIC-III), either from the Assessment section or from combined Subjective, Objective, and Assessment sections (Sharma et al., 2023, Gao et al., 2022).

Each task provides standardized data splits (train/validation/test), prompt conventions, and evaluation scripts, promoting reproducibility and direct comparison across models.

3. Data Sources and Preprocessing

The DR.BENCH clinical text suite unifies ten publicly available datasets and applies rigorous preprocessing. Key data sources include MIMIC-III progress notes, i2b2 challenges, N2C2 relations, MedNLI (de-identified clinical sentence pairs), and MedQA (medical board exam corpus). Each component task undergoes normalization (e.g., sentence splitting, lower-casing, removal of protected health information, section-boundary tagging) (Sharma et al., 2023). EMRQA leverages de-identified i2b2 notes; SOAP labeling uses lines and surrounding context for robust BIO-tag inference. MedNLI is limited to sentence pairs of approximately 25 tokens for efficient inference (Gao et al., 2022).

In the multimodal dental DR.BENCH dataset (Lv et al., 7 Nov 2025), the corpus comprises 8,775 dental checkups from 4,800 unique patients. It includes:

  • 50,000 intraoral RGB JPEG images (resized, aspect ratio preserved),
  • 8,056 radiographic images (PeX, PaX, and CBCT 2D slices),
  • EMR-derived structured text, with manual and automated translation to English,
  • Systematic annotation for both six-way dental anomaly classification and generative diagnostic report writing.

All text and images are systematically paired, preprocessed, and made available for direct ingestion by vision-LLMs.

4. Evaluation Protocols and Metrics

Each DR.BENCH task is linked to evaluation metrics reflecting the nature of its outputs:

Task Metric(s) Baseline (select)
EMRQA (QA) Exact-match accuracy 39.20% (T5-Large-RelPaths+Defs) (Gao et al., 2022)
SOAP Section Labeling Overall accuracy 60.06% (T5-Large-Defs) (Gao et al., 2022)
MedNLI Exact-match accuracy 84.88% (SciFive-Large) (Gao et al., 2022)
AP Relation Labeling Macro F1-score 80.09% (T5-Large-EHR) (Gao et al., 2022)
MedQA MC accuracy (open/closed) 24.59%/22.69% (T5-Base-RelPaths/EHR) (Gao et al., 2022)
Problem Summarization ROUGE-L F 18.72–24.84 (baselines) (Gao et al., 2022, Sharma et al., 2023)

For the multi-task clinical model suite, statistical significance is established through 95% confidence intervals (bootstrap, N=1,000). For the dental multimodal benchmarks, metrics include cross-entropy loss, accuracy, precision, recall, F1, BLEU, METEOR, ROUGE, and cosine similarity (averaged across encoder models) (Lv et al., 7 Nov 2025).

5. Model Architectures and Training Regimes

DR.BENCH supports evaluation with a variety of generative architectures:

  • Text-Only Suite: T5-small (60M), T5-base (220M), T5-large (770M) pretrained on the C4 web corpus, SciFive-large (770M, biomedical pretraining), and Clinical-T5 (770M, in-domain MIMIC-III) (Sharma et al., 2023).
  • Pretraining Domains: C4 (general), PubMed/PMC (biomedical; SciFive), MIMIC-III (clinical; Clinical-T5).
  • Fine-Tuning Regimes: Single-task (isolation) and multi-task (joint, prompt-tagged) supervision. Multi-task loss is the sum of per-task cross-entropy losses,

LMTL=i=16LiL_{\mathrm{MTL}} = \sum_{i=1}^{6} L_i

where LiL_i denotes the loss for task ii (Sharma et al., 2023).

  • Hyperparameters: Adam (β₁=0.9, β₂=0.999), learning rate 1×1051\times 10^{-5}, batch size 8, beam size 5, early stopping, training up to 100 epochs, single NVIDIA A100 GPU (Sharma et al., 2023).
  • Vision-Language Dental Suite: Qwen-VL-3B, Qwen-VL-7B, parameter-efficient LoRA adaptation, batch sizes 2-3, cosine learning rate, trained for three epochs on NVIDIA A800 GPUs (Lv et al., 7 Nov 2025).

6. Key Results and Comparative Analysis

On problem summarization, baseline ROUGE-L performance was 18.72–24.84, with Clinical-T5 (multi-task, in-domain) achieving state-of-the-art 28.55 (24.29–32.80 95% CI) (Sharma et al., 2023). In-domain pretraining is consistently advantageous (SciFive, Clinical-T5 outperform C4-pretrained T5). Multi-task fine-tuning benefits models pretrained in-domain (transductive scenario) but degrades general-domain T5 accuracy, suggesting that broad-domain models are susceptible to clinical-domain distributional shift in multi-task setups. Larger model size positively correlates with performance.

For dental anomaly classification, fine-tuned Qwen-VL-7B improves from 52.65% to 78.92% accuracy (and F1 from 52.99% to 79.39%). Qwen-VL-3B similarly demonstrates substantial post-finetuning gains. On generative diagnostic reporting, Qwen-VL-7B yields BLEU 71.85, METEOR 71.46, and 71.53% cosine similarity, outperforming zero-shot and GPT-4o baselines (Lv et al., 7 Nov 2025).

7. Data Access, Licensing, and Usage

The clinical NLP DR.BENCH suite is open-source (MIT license), with the codebase and training scripts hosted on GitLab. Underlying dataset access requires relevant data user agreements; for example, MIMIC-III (PhysioNet DUA), N2C2 (Harvard), UMLS (National Library of Medicine) (Gao et al., 2022). The dental multimodal DR.BENCH is publicly available via Hugging Face, with de-identified data and minimal restrictions apart from ethical requirements for anonymized data management (Lv et al., 7 Nov 2025).

Typical project usage for the text suite involves conda environment setup, targeted sub-dataset downloads via provided shell scripts, and execution of fine-tuning/evaluation scripts for each task. The data directory structure supports modular training and reproducibility, and evaluation utilities are included for all standard metrics (Gao et al., 2022). Both base and finetuned checkpoints are supported for benchmarking.


DR.BENCH establishes a rigorous, multifaceted benchmark for clinical AI and vision-LLMs, facilitating systematic advancement in both machine-mediated diagnostic support and the fundamental understanding of medical reasoning in computational settings (Gao et al., 2022, Sharma et al., 2023, Lv et al., 7 Nov 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DR.BENCH Dataset.