MMOral-Bench: Dental Multimodal Benchmark

Updated 4 July 2026

MMOral-Bench is a dental multimodal benchmark evaluating LVLMs on panoramic X-rays, featuring 1,100 clinically-grounded QA pairs across five diagnostic dimensions.
It employs both closed-ended accuracy metrics and open-ended LLM-as-a-judge scoring to assess detailed diagnostic reasoning and free-text interpretation.
The benchmark drives dental AI advancements by revealing model strengths in jaw analysis while highlighting challenges in tooth-level pathology and historical treatment recognition.

MMOral-Bench is a dentistry-specific multimodal benchmark lineage for evaluating large vision-LLMs on oral image interpretation. In its original formulation, MMOral-Bench is a curated evaluation suite built on top of the MMOral dataset for panoramic dental X-ray interpretation, consisting of 100 panoramic X-ray images paired with 500 closed-ended and 600 open-ended question-answer pairs across five clinically grounded diagnostic dimensions: Teeth, Patho, HisT, Jaw, and SumRec (Hao et al., 11 Sep 2025). In later dental MLLM literature, the naming broadens: MMOral-OPG denotes the panoramic-radiograph benchmark inherited from MMOral, while MMOral-Uni denotes a unified benchmark spanning five modalities and five tasks with 2,809 open-ended QA pairs (Hao et al., 27 Nov 2025). The resulting terminology is not fully stable across papers, but the common objective is consistent: clinically grounded evaluation of multimodal reasoning for dentistry (Hao et al., 11 Sep 2025, Hao et al., 27 Nov 2025).

1. Definition, naming, and scope

In the original MMOral paper, MMOral-Bench is the benchmark component of MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation (Hao et al., 11 Sep 2025). MMOral itself comprises 20,563 panoramic X-rays with 1.3 million instruction-following instances, organized into MMOral-Attribute, MMOral-Report, MMOral-VQA, and MMOral-Chat; MMOral-Bench is the high-quality evaluation subset used to assess LVLMs on panoramic dental radiographs (Hao et al., 11 Sep 2025).

Later papers preserve the panoramic core but change the surface nomenclature. OralGPT-Omni treats MMOral-OPG as the pre-existing panoramic X-ray benchmark and introduces MMOral-Uni as “the first unified benchmark for dental multimodal imaging analysis, spanning five modalities and five tasks” (Hao et al., 27 Nov 2025). DentalGPT in turn evaluates on MMOral-OPG-Bench, explicitly identifying it as the panoramic benchmark inherited from the MMOral/OralGPT line (Cai et al., 12 Dec 2025).

Benchmark term	Scope in the source	Defining paper
MMOral-Bench	100 panoramic X-ray images, 1,100 QA pairs, five diagnostic dimensions	(Hao et al., 11 Sep 2025)
MMOral-OPG	Open-ended VQA benchmark for panoramic radiographs	(Hao et al., 27 Nov 2025)
MMOral-Uni	Unified benchmark with 2,809 open-ended QA pairs across five modalities and five tasks	(Hao et al., 27 Nov 2025)

This naming pattern suggests a benchmark family centered on dental multimodal evaluation, with the panoramic benchmark as the original anchor artifact (Hao et al., 11 Sep 2025, Hao et al., 27 Nov 2025).

2. Original MMOral-Bench design for panoramic X-ray analysis

The original MMOral-Bench is a VQA-only benchmark specialized for panoramic dental X-ray interpretation (Hao et al., 11 Sep 2025). It consists of 100 panoramic X-rays, 500 closed-ended multiple-choice questions, and 600 open-ended questions, for 1,100 QA pairs in total (Hao et al., 11 Sep 2025). Every QA is mapped to one or more of five diagnostic dimensions: Teeth, Patho, HisT, Jaw, and SumRec (Hao et al., 11 Sep 2025).

The benchmark images come exclusively from the Do et al. 2024 apical periodontitis dataset, acquired at a high-quality dental treatment centre associated with Hanoi Medical University (Hao et al., 11 Sep 2025). The underlying MMOral generation pipeline first uses 10 visual specialist models trained on 10 public dental datasets, covering 49 anatomical/pathological categories; these produce grounding captions, which are then transformed into medical reports through a two-stage LLM process using DeepSeek-R1-Distill-Llama-70B followed by GPT-4-turbo revision (Hao et al., 11 Sep 2025). Reports and captions are then used to generate QA content (Hao et al., 11 Sep 2025).

For MMOral-Bench specifically, QA pairs that could not be reliably answered from the image were removed, incorrect answers detected during manual review were re-annotated, and each QA was manually assigned to one or more diagnostic dimensions (Hao et al., 11 Sep 2025). The benchmark therefore functions as a deliberately cleaned evaluation set rather than as a generic sample of the full instruction corpus.

The five dimensions encode clinically distinct forms of reasoning. Teeth concerns tooth presence, numbering, morphology, and tooth-level anomalies; Patho targets diseases or lesions such as caries and periapical lesions; HisT addresses previous dental interventions visible on the radiograph; Jaw addresses bone structures and macro-anatomy, including mandibular canal, maxillary sinus, and bone loss; and SumRec targets global understanding, summary, and recommendation generation (Hao et al., 11 Sep 2025). This structure makes the benchmark diagnostically interpretable rather than a generic visual question-answering exercise.

3. Evaluation protocol and scoring methodology

MMOral-Bench combines discrete and free-form evaluation. For closed-ended VQA, the reported metric is accuracy:

$\text{Acc} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\{\hat{y}_i = y_i\} \times 100\%.$

This is computed per diagnostic dimension and overall (Hao et al., 11 Sep 2025).

For open-ended VQA, the benchmark uses LLM-as-a-judge scoring based on MM-Vet / MM-Vet v2. A few-shot prompt with 9 examples is provided to GPT-4-turbo, including fully correct, fully incorrect, and partially correct answers, and the evaluator returns a score $s_i \in [0,1]$ for each sample (Hao et al., 11 Sep 2025). The aggregate score is

$S = \frac{\sum_{i=1}^{N} s_i}{N} \times 100\%.$

For a category $C$ with $N_C$ samples, the category score is

$S_C = \frac{\sum_{i \in C} s_i}{N_C} \times 100\%.$

(Hao et al., 11 Sep 2025)

Closed-ended prediction extraction uses a CMMMU-style extraction pipeline. Model outputs are parsed with robust regexes to recover option letters or option text; if multiple candidate options are present, the option with the highest occurrence is selected, and if no valid option can be extracted, a random selection is used as a fallback (Hao et al., 11 Sep 2025). Open-ended scoring instead delegates semantic matching and partial credit to the evaluator model (Hao et al., 11 Sep 2025).

The original paper reports two forms of reliability analysis. First, comparison with two professional dentists on 600 open-ended MMOral-Bench questions showed average absolute differences on Overall of about 2.07 points for GPT-4o outputs and 0.37 points for HealthGPT-XL32 outputs (Hao et al., 11 Sep 2025). Second, repeating GPT-4-turbo scoring 5 times with temperature $T=0$ yielded standard deviation $\sigma \le 0.434$ and coefficient of variation $\approx 0.6\%-1.3\%$ on the Overall score, indicating high evaluator stability (Hao et al., 11 Sep 2025).

The later MMOral-Uni benchmark preserves the same general philosophy—open-ended, clinically realistic free-text evaluation—but uses GPT-5-mini as the judge, with a few-shot evaluation prompt containing five in-context examples and scores in $[0,1]$ averaged per task and overall (Hao et al., 27 Nov 2025). Repeated evaluation there reported standard deviations of overall scores $s_i \in [0,1]$ 0 and coefficient of variation $s_i \in [0,1]$ 1, again supporting LLM-based scoring for dental multimodal benchmarks (Hao et al., 27 Nov 2025).

4. Reported performance and model failure modes

The original MMOral-Bench paper presents the benchmark as challenging for all tested models. It evaluates 64 LVLMs, including proprietary, open-source general-purpose, and medical-specific systems, and reports that even the best model, GPT-4o, achieves only 41.45% average score when closed-ended and open-ended performance are combined (Hao et al., 11 Sep 2025). GPT-4o’s reported breakdown is 45.40% closed-ended overall and 37.50 on the open-ended GPT-scoring scale, with dimension-wise strengths in Jaw and weaker performance in Teeth, Patho, and HisT (Hao et al., 11 Sep 2025).

A robust empirical pattern in the original benchmark is the gap between closed-ended and open-ended performance. The paper states that virtually all models perform worse on open-ended than on closed-ended questions, and among 53 open-source models, 33 fall below 25% open-ended overall (Hao et al., 11 Sep 2025). This indicates that multiple-choice constraints partially conceal deficits in clinically coherent free-form reasoning.

The paper also states that medical-specific LVLMs do not show clear advantage on MMOral-Bench (Hao et al., 11 Sep 2025). The best medical model cited there is HealthGPT-XL32 at 39.59 average, still below leading general models, which the authors interpret as evidence that dentistry requires tailored instruction data rather than assuming transfer from generic medical multimodal systems (Hao et al., 11 Sep 2025).

The error analysis identifies several recurrent failure modes. Models make fine-grained tooth-level reasoning errors, including confusion in FDI tooth numbering and incorrect localization of tooth-specific pathologies (Hao et al., 11 Sep 2025). They show partial detection of complex treatments, such as identifying crown restorations while missing root canal treatment, or vice versa (Hao et al., 11 Sep 2025). They also exhibit inadequate global understanding for SumRec and report-like questions, often missing priority issues or producing incomplete recommendations (Hao et al., 11 Sep 2025). Some proprietary systems show safety-driven refusals, and many models display a bias toward large, high-contrast structures, performing better on Jaw than on small lesions or subtle restorative details (Hao et al., 11 Sep 2025).

The same paper uses MMOral-Bench to evaluate OralGPT, a dental LVLM obtained by supervised fine-tuning Qwen2.5-VL-7B on MMOral instruction data (Hao et al., 11 Sep 2025). The baseline Qwen2.5-VL-7B obtains 27.00% closed-ended overall, 15.92 open-ended overall, and 21.46% average (Hao et al., 11 Sep 2025). After one epoch of supervised fine-tuning on MMOral-Report + MMOral-VQA + MMOral-Chat, the resulting OralGPT reaches 39.60% closed-ended overall, 52.77 open-ended overall, and 46.19% average, an improvement of 24.73 percentage points (Hao et al., 11 Sep 2025). The ablation results further report 31.81% average for Report only, 39.67% for VQA only, and 44.53% for Report + VQA, indicating that the different instruction subsets contribute complementary gains (Hao et al., 11 Sep 2025).

5. Expansion from panoramic VQA to a multimodal benchmark suite

OralGPT-Omni generalizes the MMOral paradigm by introducing MMOral-Uni, described as “the first unified benchmark for dental multimodal imaging analysis, spanning five modalities and five tasks,” with 2,809 open-ended QA pairs (Hao et al., 27 Nov 2025). Its modality-task composition includes 1,462 intraoral-image abnormality diagnosis samples, 539 periapical X-ray diagnosis samples, 383 pathological-image diagnosis samples, 300 cephalometric CVM stage prediction samples, 100 intraoral tooth localization and counting samples, 15 treatment planning interleaved image-text cases, and 10 dental treatment video comprehension cases (Hao et al., 27 Nov 2025).

MMOral-Uni also broadens abnormality coverage. For abnormality diagnosis, it spans 40 categories of diseases/abnormal conditions, including Caries, Gingivitis, Ulcer, Tooth discoloration, Defective dentition, Cancer, Orthodontics, Pulpitis, Periodontitis, Apical periodontitis, Bone loss, Root canal treatment, and several pathology-image cellular categories (Hao et al., 27 Nov 2025). All QA pairs are open-ended / free-text, and answers are generated from sparse annotations using GPT-5-mini and then validated and refined by two experienced dentists (Hao et al., 27 Nov 2025).

On this newer benchmark family, OralGPT-Omni reports 51.84 overall on MMOral-Uni and 45.31 overall on MMOral-OPG, outperforming the reported GPT-5 baselines in those evaluation settings (Hao et al., 27 Nov 2025). The paper’s ablations further report that adding TRACE-CoT improves MMOral-Uni performance from 44.31 to 48.67 in the supervised fine-tuning stage, and the full four-stage training paradigm raises overall performance from 22.88 for the baseline Qwen2.5-VL-7B to 51.84 for OralGPT-Omni (Hao et al., 27 Nov 2025).

DentalGPT provides a second line of evidence that the panoramic benchmark remains a central reference point for dental reasoning evaluation. Using the MMOral-OPG-Bench open-ended test split, DentalGPT reports 60.0% accuracy, compared with 27.0 for the untuned Qwen2.5-VL-7B backbone and 56.8 after Stage I domain adaptation without Stage II reinforcement learning (Cai et al., 12 Dec 2025). The same ablation table reports that Stage II GRPO reinforcement learning raises average performance across five dental benchmarks from 63.2 to 67.1, with the panoramic benchmark improving from 56.8 to 60.0 (Cai et al., 12 Dec 2025). Because MMOral-Bench, MMOral-OPG, and MMOral-Uni do not all share the same task mix or metric, these later scores are not numerically interchangeable with the original 41.45% MMOral-Bench average; they instead document how the MMOral family became a reference substrate for progressively more domain-specialized dental MLLMs (Hao et al., 11 Sep 2025, Hao et al., 27 Nov 2025, Cai et al., 12 Dec 2025).

The principal significance of MMOral-Bench lies in making dental multimodal evaluation clinically specific. The benchmark is not a generic medical VQA set; it operationalizes tooth-level localization, pathology recognition, historical treatment interpretation, jawbone analysis, and integrative summary/recommendation generation in the domain of panoramic dental radiography (Hao et al., 11 Sep 2025). Later extensions preserve this orientation while widening the modality range to intraoral photographs, periapical radiographs, cephalometric radiographs, pathological images, videos, and interleaved image-text treatment planning cases (Hao et al., 27 Nov 2025).

The original benchmark also has clear limitations. It includes only panoramic X-rays and therefore does not test periapical X-rays, intraoral photographs, cephalometric radiographs, CBCT, or MRI (Hao et al., 11 Sep 2025). The paper further notes possible label noise inherited from public datasets used to train the visual specialists, possible geographic/demographic bias, and limited coverage of very rare pathologies (Hao et al., 11 Sep 2025). MMOral-Uni addresses the modality limitation but still contains very small subsets for some tasks, notably 15 treatment-planning cases and 10 video-comprehension cases (Hao et al., 27 Nov 2025).

A recurrent misconception is to treat “MMOral-Bench” as a single fixed artifact across the entire literature. The papers themselves use related but non-identical labels: MMOral-Bench in the original panoramic benchmark paper, MMOral-OPG and MMOral-Uni in OralGPT-Omni, and MMOral-OPG-Bench in DentalGPT (Hao et al., 11 Sep 2025, Hao et al., 27 Nov 2025, Cai et al., 12 Dec 2025). This suggests terminological drift rather than a contradiction in scientific purpose.

The broader dental MLLM ecosystem has also produced adjacent benchmark efforts that move beyond the MMOral naming line. OralMLLM-Bench evaluates 27 clinically grounded tasks across periapical, panoramic, and lateral cephalometric radiographs using the cognitive categories perception, comprehension, prediction, and decision-making, with 3,820 clinician assessments and explicit metrics such as Balanced Accuracy, PCAS, and CRA (Wang et al., 2 May 2026). COde presents “a benchmark multimodal oro-dental dataset for large vision-LLMs,” comprising 8,775 dental checkups from 4,800 patients, with 50,000 intraoral images, 8,056 radiographs, and rich bilingual clinical text, and defines benchmark tasks for six-class anomaly classification and full diagnostic report generation (Lv et al., 7 Nov 2025). These works indicate that the research agenda inaugurated by MMOral-Bench has expanded from panoramic VQA into broader multimodal oral AI evaluation, but the original MMOral benchmark remains the canonical starting point for dentistry-specific LVLM benchmarking (Hao et al., 11 Sep 2025, Wang et al., 2 May 2026, Lv et al., 7 Nov 2025).