DR.BENCH Diagnostic Reasoning Benchmark
- DR.BENCH is a clinically motivated benchmark suite that integrates text and multimodal data to assess diagnostic reasoning capabilities in AI models.
- The suite standardizes six key tasks, including EMRQA and MedQA, with unified prompt formatting to enable reproducible evaluations.
- It leverages diverse data sources and rigorous preprocessing to drive advancements in both clinical text and dental multimodal diagnostic assessments.
DR.BENCH (Diagnostic Reasoning Benchmark) is a comprehensive, clinically motivated suite of generative and classification benchmarks designed to evaluate and drive progress in artificial intelligence systems for diagnostic and medical reasoning across both clinical NLP and multimodal vision-language tasks. Developed to capture the complexity of clinical decision-making, DR.BENCH includes tasks grounded in diverse data modalities—text, images, and structured electronic health records (EHR)—and is structured to support rigorous model assessment and comparison. The primary DR.BENCH suite targets clinical text-based reasoning (Sharma et al., 2023, Gao et al., 2022), while an independent dataset under the same acronym addresses oro-dental multimodal diagnostics (Lv et al., 7 Nov 2025).
1. Scope and Motivation
The creation of DR.BENCH is motivated by the need to move beyond entity recognition and relation extraction towards the computational modeling of clinical diagnostic reasoning. Diagnostic error is a leading source of patient harm; cognitive overload and information-processing bottlenecks in healthcare systems are significant contributors. Existing benchmarks have largely sidestepped the core cognitive tasks of reasoning from patient data to diagnoses (Gao et al., 2022). DR.BENCH was constructed to fill this gap, offering standardized generative tasks that span such cognitive skills as medical knowledge representation, evidence integration, and diagnosis generation. This design encourages the development of models that robustly replicate human-like diagnostic inference and summarizes the current limitations and opportunities in clinical NLP and medical AI (Sharma et al., 2023, Gao et al., 2022).
2. Benchmark Design and Task Structure
DR.BENCH consists of six primary generative tasks, each cast in a text-to-text scheme with unified prompt formatting to enable consistent model training and evaluation. These tasks sample the key cognitive skill domains required in clinical reasoning:
- Clinical Text Understanding and Integration
- EMRQA (Extractive Clinical QA): Models answer evidence-seeking questions over de-identified discharge summaries from the EMRQA corpus, leveraging i2b2 challenge data (≥42,607 training examples) (Gao et al., 2022).
- SOAP Section Labeling: Models classify progress-note lines into BIO-style tags corresponding to Subjective, Objective, Assessment, or Plan sections, trained on over 100,000 examples (Gao et al., 2022).
- Medical Knowledge Representation and Reasoning
- MedNLI: Clinical natural language inference on MIMIC-III sentences, requiring models to judge entailment, contradiction, or neutrality in premise-hypothesis pairs (Gao et al., 2022).
- Assessment–Plan Relation Labeling: Predicts the causal or relevance relation between an assessment statement and a treatment plan excerpt, based on N2C2 2018 annotations (Gao et al., 2022).
- Diagnosis Generation and Summarization
- MedQA: Multi-choice, USMLE-style board exam medical question answering, with both closed-book and open-book variants (contextualized with textbook retrieval) (Gao et al., 2022).
- Problem Summarization: Abstractive summarization of problem lists from progress notes (MIMIC-III), either from the Assessment section or from combined Subjective, Objective, and Assessment sections (Sharma et al., 2023, Gao et al., 2022).
Each task provides standardized data splits (train/validation/test), prompt conventions, and evaluation scripts, promoting reproducibility and direct comparison across models.
3. Data Sources and Preprocessing
The DR.BENCH clinical text suite unifies ten publicly available datasets and applies rigorous preprocessing. Key data sources include MIMIC-III progress notes, i2b2 challenges, N2C2 relations, MedNLI (de-identified clinical sentence pairs), and MedQA (medical board exam corpus). Each component task undergoes normalization (e.g., sentence splitting, lower-casing, removal of protected health information, section-boundary tagging) (Sharma et al., 2023). EMRQA leverages de-identified i2b2 notes; SOAP labeling uses lines and surrounding context for robust BIO-tag inference. MedNLI is limited to sentence pairs of approximately 25 tokens for efficient inference (Gao et al., 2022).
In the multimodal dental DR.BENCH dataset (Lv et al., 7 Nov 2025), the corpus comprises 8,775 dental checkups from 4,800 unique patients. It includes:
- 50,000 intraoral RGB JPEG images (resized, aspect ratio preserved),
- 8,056 radiographic images (PeX, PaX, and CBCT 2D slices),
- EMR-derived structured text, with manual and automated translation to English,
- Systematic annotation for both six-way dental anomaly classification and generative diagnostic report writing.
All text and images are systematically paired, preprocessed, and made available for direct ingestion by vision-LLMs.
4. Evaluation Protocols and Metrics
Each DR.BENCH task is linked to evaluation metrics reflecting the nature of its outputs:
| Task | Metric(s) | Baseline (select) |
|---|---|---|
| EMRQA (QA) | Exact-match accuracy | 39.20% (T5-Large-RelPaths+Defs) (Gao et al., 2022) |
| SOAP Section Labeling | Overall accuracy | 60.06% (T5-Large-Defs) (Gao et al., 2022) |
| MedNLI | Exact-match accuracy | 84.88% (SciFive-Large) (Gao et al., 2022) |
| AP Relation Labeling | Macro F1-score | 80.09% (T5-Large-EHR) (Gao et al., 2022) |
| MedQA | MC accuracy (open/closed) | 24.59%/22.69% (T5-Base-RelPaths/EHR) (Gao et al., 2022) |
| Problem Summarization | ROUGE-L F | 18.72–24.84 (baselines) (Gao et al., 2022, Sharma et al., 2023) |
For the multi-task clinical model suite, statistical significance is established through 95% confidence intervals (bootstrap, N=1,000). For the dental multimodal benchmarks, metrics include cross-entropy loss, accuracy, precision, recall, F1, BLEU, METEOR, ROUGE, and cosine similarity (averaged across encoder models) (Lv et al., 7 Nov 2025).
5. Model Architectures and Training Regimes
DR.BENCH supports evaluation with a variety of generative architectures:
- Text-Only Suite: T5-small (60M), T5-base (220M), T5-large (770M) pretrained on the C4 web corpus, SciFive-large (770M, biomedical pretraining), and Clinical-T5 (770M, in-domain MIMIC-III) (Sharma et al., 2023).
- Pretraining Domains: C4 (general), PubMed/PMC (biomedical; SciFive), MIMIC-III (clinical; Clinical-T5).
- Fine-Tuning Regimes: Single-task (isolation) and multi-task (joint, prompt-tagged) supervision. Multi-task loss is the sum of per-task cross-entropy losses,
where denotes the loss for task (Sharma et al., 2023).
- Hyperparameters: Adam (β₁=0.9, β₂=0.999), learning rate , batch size 8, beam size 5, early stopping, training up to 100 epochs, single NVIDIA A100 GPU (Sharma et al., 2023).
- Vision-Language Dental Suite: Qwen-VL-3B, Qwen-VL-7B, parameter-efficient LoRA adaptation, batch sizes 2-3, cosine learning rate, trained for three epochs on NVIDIA A800 GPUs (Lv et al., 7 Nov 2025).
6. Key Results and Comparative Analysis
On problem summarization, baseline ROUGE-L performance was 18.72–24.84, with Clinical-T5 (multi-task, in-domain) achieving state-of-the-art 28.55 (24.29–32.80 95% CI) (Sharma et al., 2023). In-domain pretraining is consistently advantageous (SciFive, Clinical-T5 outperform C4-pretrained T5). Multi-task fine-tuning benefits models pretrained in-domain (transductive scenario) but degrades general-domain T5 accuracy, suggesting that broad-domain models are susceptible to clinical-domain distributional shift in multi-task setups. Larger model size positively correlates with performance.
For dental anomaly classification, fine-tuned Qwen-VL-7B improves from 52.65% to 78.92% accuracy (and F1 from 52.99% to 79.39%). Qwen-VL-3B similarly demonstrates substantial post-finetuning gains. On generative diagnostic reporting, Qwen-VL-7B yields BLEU 71.85, METEOR 71.46, and 71.53% cosine similarity, outperforming zero-shot and GPT-4o baselines (Lv et al., 7 Nov 2025).
7. Data Access, Licensing, and Usage
The clinical NLP DR.BENCH suite is open-source (MIT license), with the codebase and training scripts hosted on GitLab. Underlying dataset access requires relevant data user agreements; for example, MIMIC-III (PhysioNet DUA), N2C2 (Harvard), UMLS (National Library of Medicine) (Gao et al., 2022). The dental multimodal DR.BENCH is publicly available via Hugging Face, with de-identified data and minimal restrictions apart from ethical requirements for anonymized data management (Lv et al., 7 Nov 2025).
Typical project usage for the text suite involves conda environment setup, targeted sub-dataset downloads via provided shell scripts, and execution of fine-tuning/evaluation scripts for each task. The data directory structure supports modular training and reproducibility, and evaluation utilities are included for all standard metrics (Gao et al., 2022). Both base and finetuned checkpoints are supported for benchmarking.
DR.BENCH establishes a rigorous, multifaceted benchmark for clinical AI and vision-LLMs, facilitating systematic advancement in both machine-mediated diagnostic support and the fundamental understanding of medical reasoning in computational settings (Gao et al., 2022, Sharma et al., 2023, Lv et al., 7 Nov 2025).