MMOral-OPG-Bench: Dental LVLM Benchmark
- MMOral-OPG-Bench is a specialized multimodal evaluation suite that assesses large vision-language models on panoramic dental radiograph interpretation using clinical diagnostic criteria.
- It incorporates a rigorously annotated dataset with 100 OPG images and 1100 VQA items, ensuring zero-shot evaluation and clear separation from training data.
- Benchmark results reveal robust jaw analysis but highlight challenges with tooth-level detection and free-text report synthesis, driving targeted improvements in dental AI.
MMOral-OPG-Bench is a domain-specific, multimodal benchmark and evaluation suite designed for the assessment of large vision-LLMs (LVLMs) in the context of panoramic dental radiograph (orthopantomogram, OPG) interpretation. It aims to directly model the interpretative workflow of oral radiology specialists, addressing both gross morphological assessments and subtle pathological reasoning, which are not comprehensively covered by prior general medical vision-language resources (Hao et al., 11 Sep 2025, Hao et al., 27 Nov 2025).
1. Benchmark Scope and Rationale
MMOral-OPG-Bench was established to address the unique complexity of panoramic dental X-rays. These images capture a comprehensive, densely structured view of maxillofacial anatomy, requiring reasoning about individual teeth, bone integrity, previous interventions, and subtle disease manifestations. Its design objectives are:
- To evaluate LVLM performance on tasks that emulate real diagnostician workflows, using OPGs.
- To analyze model competence across five core diagnostic dimensions: Teeth (identification, numbering, alignment), Patho (caries, lesions, cysts), HisT (historical treatments such as fillings or implants), Jaw (bone loss, anatomical architecture), and SumRec (summaries and recommendations).
This framework enables a fine-grained, clinically grounded understanding of where current multimodal models succeed or fail in oral radiology (Hao et al., 11 Sep 2025).
2. Dataset Construction and Organization
MMOral-OPG-Bench is extracted as the evaluation split of the broader MMOral resource, which contains 20,563 annotated panoramic radiographs and 1.3 million instruction-following QA pairs (Hao et al., 11 Sep 2025). Key composition details include:
- Test split: 100 independent OPG images, each sourced from a high-quality, publicly available dental radiograph dataset.
- Tasks: 500 closed-ended (multiple-choice) VQA items and 600 open-ended VQA items linked to these images, ensuring each diagnostic dimension is probed.
- No Train/Val Leakage: All patient duplication is eliminated; none of the 100 benchmark images or their QA pairs overlap with model training data, supporting strict zero-shot evaluation protocols.
Additional multimodal oro-dental datasets, such as the one in (Lv et al., 7 Nov 2025), extend this structure with thousands of checkups, intraoral photographs, CBCT-based radiographs, and fully structured bilingual textual records, supporting broader diagnostic and generative tasks.
3. Task Definitions and Input-Output Protocols
MMOral-OPG-Bench structures its evaluation across four task modalities, with each mapping to targeted diagnostic operations:
- Attribute Extraction: Input is a single OPG; output is a list of bounding boxes, class categories, and confidence values for up to 49 fine-grained subcategories (tooth numbers, caries, implants).
- Report Generation: Input is a programmatically generated "grounding caption" listing all detected attributes; output is a multi-section clinical report comprising teeth observations, jaw findings, and summary recommendations.
- Visual Question Answering (VQA):
- Closed-ended: Given an image and a question, the output is a categorical answer (multiple-choice: A/B/C/D).
- Open-ended: Given an image and a question, the output is free-form natural language, scored on a [0,1] scale by a LLM judge.
- Image-Grounded Dialogue: Interactive scenario (“chat”) between a user (patient or clinician) and model, wherein the model’s answers must be grounded in OPG findings and phrased in accessible language (Hao et al., 11 Sep 2025).
4. Evaluation Metrics and Scoring Procedures
MMOral-OPG-Bench employs two primary evaluation paradigms:
- Closed-Ended Accuracy:
- Open-Ended LLM-Assisted Scoring:
For each item , a pretrained LLM (e.g., GPT-4-turbo or GPT-5-mini) assigns a score to the model’s generated answer. Scores are averaged as:
For a particular diagnostic category :
Standard classification/detection metrics are referenced for completeness but not directly applied in MMOral-OPG-Bench VQA tasks. These include:
Per-category and micro-averaged global results are reported to delineate performance strengths and failure modes (Hao et al., 11 Sep 2025, Hao et al., 27 Nov 2025).
5. Baseline and Fine-Tuned Model Performance
A large-scale zero-shot comparison revealed persistent challenges for generalist and medical-specialized LVLMs. Representative findings include:
- GPT-4o (Nov 2024) achieved the highest baseline: 45.40% accuracy (closed-ended) and 37.50% (open-ended), resulting in an overall 41.45% (Hao et al., 11 Sep 2025).
- HealthGPT-XL32 (medical-specialized) obtained 39.59% overall; no proprietary or open-source generalist LVLM exceeded GPT-4o’s performance.
- Models demonstrated higher scores on “Jaw” dimension (e.g., bone pattern, canals) but were systematically weaker on “Teeth,” “Patho,” and “HisT,” indicating difficulties with tooth-level and fine-grained pathology reasoning.
Supervised finetuning with MMOral’s instruction-following data resulted in substantial improvements:
- Base Qwen 2.5-VL-7B (zero-shot) overall: 21.46%.
- After one epoch of SFT (report, VQA, chat data): 46.19% (+24.73 pp).
OralGPT-Omni, leveraging the TRACE-CoT reasoning corpus and a staged training design, achieved a new state of the art: 45.31 overall (notable gains in Patho and His dimensions) (Hao et al., 27 Nov 2025).
| Model | Teeth | Patho | His | Jaw | Summ | Report | Overall |
|---|---|---|---|---|---|---|---|
| GPT-5 | 39.77 | 29.32 | 44.05 | 78.56 | 40.12 | 28.20 | 42.42 |
| GPT-4V | 31.46 | 23.79 | 39.51 | 69.81 | 34.29 | 43.70 | 39.38 |
| HealthGPT-XL32 | 30.64 | 25.83 | 27.98 | 51.12 | 17.02 | 8.00 | 27.80 |
| OralGPT-Omni | 37.26 | 43.94 | 55.34 | 70.50 | 38.57 | 37.90 | 45.31 |
These quantitative results illustrate both the progress from dental instruction finetuning and the persistent ceiling in free-form diagnostic language generation and tooth-level error sensitivity (Hao et al., 11 Sep 2025, Hao et al., 27 Nov 2025).
6. Challenges, Limitations, and Future Perspectives
Persistent limitations are evident in all models, including state-of-the-art dental-specialized MLLMs:
- Free-text report synthesis remains a major challenge, with all models struggling to generate coherent, structured multi-section diagnostic narratives from a single OPG.
- Panoramic radiographs uniquely stress models due to anatomical overlap, variation in patient pose, and complex local/global pathology integration.
- The LLM-judge paradigm shows stable inter-score variation (CV < 1%) but is inherently a proxy for human expert consensus, not a replacement.
Proposed research directions include:
- Modal diversity: Extending the benchmark to periapical films, intraoral photos, cephalometric projections, and 3D CBCT/MRI for generalist dental AI.
- Spatial supervision: Incorporating region-level and structured templates to enhance spatial grounding and template-driven reporting.
- Consensus and standardization: Integrating multi-expert human grading, established dental ontologies (e.g., FDI tooth numbering, WHO pathology codes), and hybrid classification/VQA/detection metrics.
- Context-rich evaluation: Fusing radiograph data with patient metadata and clinical textual notes to more closely mimic full digital dentistry scenarios (Hao et al., 11 Sep 2025, Hao et al., 27 Nov 2025).
A plausible implication is that future benchmarks will adopt hybrid endpoints—combining region-localized detection, structured report evaluation, and contextualized dialogue—for comprehensive, clinically relevant assessment.
7. Significance in the Advancement of Dental AI
MMOral-OPG-Bench, alongside its released datasets and baseline models, defines the current gold standard for panoramic X-ray LVLM evaluation in dentistry (Hao et al., 11 Sep 2025, Lv et al., 7 Nov 2025). Its rigorously annotated, multiperspective design supports reproducible, zero-shot assessment and fine-tuning of both generalist and domain-specialized large models. The pronounced performance gap, especially in nuanced clinical reasoning and free-form reporting, establishes an actionable target for the next generation of multimodal dental AI systems. The benchmark’s continued evolution, towards richer imaging inclusion and expert-validated ground truth, is expected to play a pivotal role in the development of reliable, high-impact dental decision-support tools.