MMOral: Dental Multimodal X-Ray Benchmark

Updated 4 July 2026

MMOral is a dental multimodal framework that integrates panoramic X-ray interpretation with a large-scale instruction dataset tailored for dental diagnostics.
The dataset comprises 20,563 annotated images and 1.3M instruction-following instances across tasks like attribute extraction, report generation, VQA, and dialogue.
The OralGPT baseline, fine-tuned on this domain-specific data, significantly improves diagnostic performance, highlighting the impact of targeted instruction tuning.

MMOral is a dental multimodal data and evaluation framework centered on panoramic X-ray interpretation. It was introduced as the first large-scale multimodal instruction dataset and benchmark tailored for this modality, with 20,563 annotated images, 1.3 million instruction-following instances, a domain-grounded benchmark, and a supervised fine-tuned baseline model, OralGPT (Hao et al., 11 Sep 2025). In subsequent work, the original panoramic benchmark is treated as part of a broader MMOral ecosystem, notably through the names MMOral-OPG, MMOral-Uni, and MMOral-X, which extend evaluation from panoramic radiography to wider dental multimodality and to more holistic diagnostic reasoning (Hao et al., 27 Nov 2025, Fan et al., 6 Mar 2026).

1. Origin, scope, and problem setting

MMOral emerged from the observation that large vision-LLMs had shown strong performance on general-purpose medical tasks, while dentistry, and especially panoramic radiography, remained underrepresented in both benchmarks and instruction-tuning corpora (Hao et al., 11 Sep 2025). Panoramic X-rays are described as one of the most commonly used oral radiology modalities because they depict the full dentition and surrounding structures in a single image, yet they are difficult to interpret owing to dense anatomy, subtle lesions, and fine-grained relationships among teeth, bone, and prior treatments. The original MMOral paper positions these characteristics as the reason a dental-specific instruction dataset and benchmark are necessary.

The framework is explicitly oriented toward panoramic dental X-ray analysis rather than generic medical VQA. Its central premise is that clinically meaningful oral-radiology AI requires both instruction data grounded in dental structure and pathology and an evaluation suite that measures panoramic-X-ray-specific reasoning rather than only general multimodal competence (Hao et al., 11 Sep 2025). A related implication, made explicit in later follow-up work, is that panoramic radiography became the initial anchor point for a broader dental multimodal benchmark lineage, with MMOral-OPG functioning as the panoramic component of that ecosystem (Hao et al., 27 Nov 2025).

A common misconception is to treat MMOral as only a benchmark. In the original formulation, it is simultaneously a large instruction corpus, a benchmark, and a baseline model-development resource. Another misconception is to view it as a new dental model architecture; the original paper does not introduce a new backbone architecture, but rather a domain-adapted LVLM, OralGPT, obtained by supervised fine-tuning an existing vision-LLM (Hao et al., 11 Sep 2025).

2. Dataset composition and annotation design

The original MMOral dataset contains 20,563 annotated panoramic X-ray images and 1.3 million instruction-following instances derived from two public image sources: 16,639 unique images from TED3 and 3,924 images from Hoang Viet Do et al. (Hao et al., 11 Sep 2025). The instruction data are partitioned into four sub-datasets that cover attribute extraction, report generation, VQA, and image-grounded dialogue.

Component	Content	Scale
MMOral-Attribute	Attribute extraction from panoramic radiographs	904k
MMOral-Report	Grounding caption and medical report	41k
MMOral-VQA	Closed-ended and open-ended visual question answering	965k
MMOral-Chat	Image-grounded, multi-turn assistant–patient dialogue	296k

The dataset design reflects the structural density of panoramic radiography. Each panoramic image contains an average of 44 bounding boxes, and the authors build ten visual specialist models that detect 49 anatomical/pathological categories with overlapping category spaces to improve reliability (Hao et al., 11 Sep 2025). MMOral-Attribute targets categories, locations, and relationships among anatomical structures; MMOral-Report provides both a grounding caption and a medical report for each image; MMOral-VQA includes both closed-ended and open-ended QA; and MMOral-Chat simulates multi-turn dialogue grounded in the image and report content.

Instruction generation combines template-based generation and LLM-based generation, with QA pairs and dialogues grounded in extracted anatomical structures and medical reports (Hao et al., 11 Sep 2025). The paper also describes a two-stage report generation and correction process intended to reduce label noise; 95.45% of reports are successfully corrected in the second LLM-based refinement stage. At the same time, the authors state a key limitation: the ground-truth labels come from public datasets and have not been independently validated by a third party.

This curation strategy is significant because MMOral treats dental multimodal learning as more than captioning. It formalizes a spectrum ranging from fine-grained structural extraction to report synthesis and patient-facing conversational behavior. A plausible implication is that the dataset was designed to expose failure modes at multiple abstraction levels, from local tooth status to global clinical explanation.

3. MMOral-Bench and its diagnostic axes

To evaluate panoramic-X-ray reasoning, the original work introduces MMOral-Bench, a curated benchmark built from 100 images with 500 closed-ended questions and 600 open-ended questions (Hao et al., 11 Sep 2025). The benchmark was manually selected and validated to ensure image quality and answerability, and incorrect QA pairs were re-annotated. Its purpose is to provide a clinically grounded test suite for panoramic interpretation rather than a generic VQA leaderboard.

MMOral-Bench is organized around five diagnostic dimensions: Teeth, Patho, HisT, Jaw, and SumRec (Hao et al., 11 Sep 2025). These correspond respectively to the condition of teeth, pathological findings, historical treatments, jawbone observations, and clinical summary/recommendation. The dimension-wise structure is important because it separates coarse structural recognition from fine-grained pathology and treatment recognition. The paper explicitly notes that this design reveals whether models are stronger on larger anatomical structures, such as mandibular canals and maxillary sinuses, than on tooth-level abnormalities and prior interventions.

The benchmark’s mixed closed-ended and open-ended design is also methodologically consequential. Closed-ended questions probe constrained recognition, while open-ended questions demand explanation and synthesis. The original study reports that open-ended questions are substantially harder than closed-ended ones, which suggests that clinical fluency in free-form dental reasoning remains a more demanding problem than constrained selection or short-form recognition (Hao et al., 11 Sep 2025).

Later work reframes this original panoramic benchmark as MMOral-OPG and uses it as an external panoramic evaluation set for broader dental MLLMs (Hao et al., 27 Nov 2025). That renaming indicates continuity rather than replacement: MMOral-Bench becomes the panoramic component within an expanding dental benchmark family.

4. Empirical findings and the OralGPT baseline

The original evaluation tests 64 LVLMs zero-shot on MMOral-Bench, spanning proprietary models, general-purpose open-source models, and medical-specific models (Hao et al., 11 Sep 2025). The headline result is that even GPT-4o, the best-performing model in the study, achieves only 41.45% overall. The paper further reports that many general medical LVLMs score below 40% average, that medical-specific models do not clearly outperform general-purpose models, and that some proprietary systems sometimes refuse to answer because of safety filters.

Performance is not uniform across categories. Models tend to do better on Jaw-related questions and worse on Teeth, Patho, and HisT, indicating a strong category bias toward larger structures and away from finer-grained dental pathology and treatment reasoning (Hao et al., 11 Sep 2025). This pattern supports the paper’s claim that panoramic interpretation remains difficult even for state-of-the-art LVLMs, especially when the task requires tooth-level discrimination, recognition of subtle lesions, or recovery of treatment history.

To demonstrate the utility of the instruction corpus, the authors introduce OralGPT, which performs supervised fine-tuning on Qwen2.5-VL-7B using the LLaMA-Factory framework with default hyperparameters for one epoch (Hao et al., 11 Sep 2025). OralGPT is therefore a domain-adapted LVLM rather than a new architecture. The training uses combinations of MMOral-Report, MMOral-VQA, and MMOral-Chat, with the strongest configuration using all three.

Fine-tuning data	MMOral-Bench score
Base Qwen2.5-VL-7B	21.46%
MMOral-Report alone	31.81%
MMOral-VQA alone	39.67%
Report + VQA	44.53%
Report + VQA + Chat	46.19%

The final configuration improves the average score from 21.46% to 46.19%, which the paper describes as a 24.73% improvement after a single epoch of SFT (Hao et al., 11 Sep 2025). The most direct interpretation is that high-quality, domain-specific instruction data provide a strong adaptation signal even without architectural change. The paper also notes that human-evaluation analysis suggests GPT-4-turbo scoring aligns reasonably well with dentist judgments for open-ended responses, supporting the use of LLM-based evaluation in this setting.

5. Expansion from panoramic MMOral to a broader dental benchmark family

Later papers substantially broaden the meaning of MMOral by extending the original panoramic focus into a multi-benchmark ecosystem. In "OralGPT-Omni: A Versatile Dental Multimodal LLM," the earlier panoramic benchmark is referred to as MMOral-OPG, and a new unified benchmark, MMOral-Uni, is introduced as the first unified multimodal benchmark for dental image analysis (Hao et al., 27 Nov 2025). MMOral-Uni contains 2,809 open-ended question-answer pairs spanning five modalities and five tasks, including intraoral photographs, periapical radiographs, cephalometric radiographs, pathological images, intraoral videos, and interleaved image-text inputs for treatment planning. Its task set includes abnormality diagnosis, CVM stage prediction, treatment planning, tooth localization and counting, and dental treatment video comprehension.

The OralGPT-Omni paper places MMOral within a broader training-and-evaluation ecosystem that also includes TRACE-CoT, a clinically grounded chain-of-thought dataset, and a four-stage training paradigm: dental knowledge injection, dental caption alignment, supervised fine-tuning, and reinforcement learning tuning (Hao et al., 27 Nov 2025). On MMOral-Uni, OralGPT-Omni achieves an overall score of 51.84, compared with 36.42 for GPT-5; on MMOral-OPG it achieves 45.31, compared with 42.42 for GPT-5. The authors also report a relative weakness in treatment planning and in panoramic report generation, attributing these weaknesses to limited treatment-planning data and the complexity of dense panoramic anatomy.

A further development appears in "OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis," which introduces MMOral-X as the first benchmark for comprehensive diagnosis from a single panoramic radiograph (Fan et al., 6 Mar 2026). MMOral-X contains 300 open-ended question-answer pairs and 686 bounding boxes, and it is explicitly contrasted with MMOral-OPG: MMOral-OPG focuses on localized, lesion-specific queries, whereas MMOral-X targets whole-image holistic diagnosis. The benchmark covers disease classification, grounding, tooth-level assessments, regional evaluations, abnormality flags, and recognition of clinically relevant non-pathological findings such as orthodontic appliances, restorations, and surgical devices. It is also stratified into Simple, Moderate, and Complex subsets, each with 100 images.

This progression indicates a clear trajectory in the MMOral line of research. The original work emphasizes domain-specific instruction tuning and benchmark construction for panoramic VQA and reporting; the later benchmark family adds unified multimodality, explicit reasoning supervision, and eventually agentic, symmetry-aware, tool-using diagnosis (Hao et al., 11 Sep 2025, Hao et al., 27 Nov 2025, Fan et al., 6 Mar 2026). A plausible implication is that MMOral has evolved from a panoramic dental dataset into a naming convention for a broader suite of dental multimodal evaluation problems.

6. Limitations, methodological issues, and significance

The original MMOral paper identifies limited imaging-modality diversity as a primary constraint because the dataset focuses on panoramic X-rays rather than the full spectrum of oral imaging (Hao et al., 11 Sep 2025). The authors explicitly state that future work should expand to periapical X-rays, intraoral photographs, cephalometric radiographs, CBCT, and MRI. They also acknowledge that the labels come from public datasets without independent third-party validation, even though they attempt to reduce noise through overlapping specialist models, post-processing, and report refinement.

Later MMOral-family papers retain related methodological tensions. MMOral-Uni and MMOral-X both adopt LLM-as-judge evaluation for open-ended scoring, using GPT-5-mini and emphasizing synonym tolerance, clinically equivalent wording, and graded correctness rather than exact match (Hao et al., 27 Nov 2025, Fan et al., 6 Mar 2026). Those papers report stability analyses and dentist-comparison experiments to justify the approach, but they also acknowledge randomness and subjective variability in LLM-based judging. This means that MMOral-style evaluation is not a simple deterministic metric pipeline; it is an attempt to balance clinical-semantic fidelity with the practical need to score open-ended radiological reasoning.

From a research perspective, MMOral is significant because it exposes a gap between general medical multimodal performance and dental-specific competence. The original benchmark shows that panoramic radiograph interpretation is difficult for even the strongest LVLMs tested, and the later benchmarks show that difficulty increases further when the task shifts from localized questioning to unified multimodal dentistry or holistic panoramic diagnosis (Hao et al., 11 Sep 2025, Hao et al., 27 Nov 2025, Fan et al., 6 Mar 2026). This suggests that transfer from generic medical benchmarks is insufficient for oral radiology.

The broader significance of MMOral lies in how it operationalizes clinically grounded dental AI. It ties together instruction following, report generation, VQA, dialogue, benchmark construction, and later reasoning-oriented and agentic extensions. In that sense, MMOral functions both as a resource and as a research program: dental multimodal systems should be trained on dental-specific data, evaluated on dental-specific reasoning tasks, and judged not only by generic multimodal fluency but by their ability to support clinically meaningful interpretation of oral imaging (Hao et al., 11 Sep 2025).