Papers
Topics
Authors
Recent
2000 character limit reached

DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry (2512.11558v1)

Published 12 Dec 2025 in cs.CV, cs.AI, and cs.CL

Abstract: Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal LLMs (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM's visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.

Summary

  • The paper presents DentalGPT, a multimodal language model specialized for dental diagnostics, substantially improving accuracy over generalist models.
  • DentalGPT employs a two-stage training process: extensive domain-specific data pretraining followed by reinforcement learning to enhance complex reasoning abilities.
  • Significant improvements are reported in diagnostic accuracy, with DentalGPT achieving up to 84.0% on benchmarks, outperforming larger general models.

DentalGPT: Domain-Specialized Multimodal Reasoning for Automated Dental Diagnostics

Introduction and Motivation

Recent developments in Multimodal LLMs (MLLMs) have enabled the integration of vision and language for a variety of medical applications. Nevertheless, performant domain adaptation remains a significant challenge, particularly for fields such as dentistry where reliable interpretation depends on extracting fine-grained visual cues from dental images and accurately mapping them to domain-appropriate reasoning chains. Existing general-purpose MLLMs consistently underperform on dental multimodal tasks, failing to achieve clinician-level accuracy in disease classification and diagnosis. The suboptimal results stem from limited domain-relevant pretraining data and a lack of tailored training routines to incentivize deep, reflective visual reasoning. Figure 1

Figure 1: Examples of top-tier general-purpose MLLMs analyzing a dental image task. Red indicates incorrect analysis, green indicates correct analysis, and orange highlights reflective turns in the complex reasoning process.

This motivates the introduction of DentalGPT, a highly specialized MLLM for dentistry. DentalGPT is constructed through a two-stage pipeline: large-scale domain data pretraining for enhancing multimodal understanding, and a subsequent reinforcement learning (RL) phase to unlock complex, clinically relevant reasoning in dental diagnostics (2512.11558). The model’s efficacy is empirically validated on clinically curated intraoral and panoramic datasets as well as dental-focused VQA benchmarks, demonstrating superiority to both open-source and proprietary models with significantly larger parameter footprints.

Model Architecture and Training Pipeline

Stage I: Multimodal Understanding Enhancement

DentalGPT’s backbone is initialized with Qwen2.5-VL-7B. The first stage involves an extensive alignment of the model’s vision-language embedding space to dental domain knowledge by leveraging the largest dental image-language dataset to date, comprising 120,000+ annotated images accompanied by systematically curated captions and QA pairs for downstream tasks. Figure 2

Figure 2: The 2-stage process of building DentalGPT. Multimodal Understanding Enhancement stage uses a large dataset to align the model's medical knowledge with its multimodal understanding and prepare it for downstream tasks; Reinforcement Learning then strengthens complex reasoning ability.

All images are paired with detailed, diagnosis-oriented textual descriptions generated and verified by GPT-5, ensuring maximal domain accuracy and knowledge density. The curriculum further includes general-domain vision-language data to prevent catastrophic forgetting and overfitting.

Stage II: Reinforcement Learning for Clinical Reasoning

Subsequent to supervised alignment, DentalGPT undergoes RL with Group Relative Policy Optimization (GRPO) on a novel, independently sourced set of dental images and multiple-choice QA tasks. GRPO incentivizes chains of thought (CoT) reasoning by dynamically sampling and evaluating multiple response trajectories per prompt, rewarding both correctness and format adherence. This approach explicitly trains the model to generate reflective, multi-step diagnostic reasoning under domain constraints, a critical property for high-stakes clinical application.

Data Engineering and Quality Control

A rigorous data engineering workflow consolidates open-source resources, hospital archives, and new, expert-annotated samples to correct for bias and ensure comprehensive condition/class coverage. Figure 3

Figure 3: Annotation examples from different dental image collections.

The resulting dataset is subject to multi-stage quality assurance: synthetic caption/QA generation and expert label cross-validation, followed by automated multi-dimensional assessment using Gemini-2.5-Pro. The comparison to direct LLM self-distillation reveals higher scores for description completeness, terminology consistency, and knowledge depth in DentalGPT’s training corpus. Figure 4

Figure 4: Gemini-2.5-Pro's multi-dimensional evaluation of GPT-5–distilled data and the training dataset of DentalGPT, highlighting gains in professional quality and knowledge density.

Benchmark Construction and Experimental Setup

A comprehensive evaluation suite is constructed, comprising:

  • MMOral-OPG-Bench: High-quality panoramic X-ray VQA with expert labels.
  • DentalBench-Mixed: Aggregated tooth-image subsets from established medical VQA datasets.
  • Expert-annotated Benchmarks: Multi-label intraoral and panoramic benchmarks with strict cross-validation among board-certified dentists, ensuring robust clinical validity. Figure 5

    Figure 5: Examples of Expert-annotated Benchmarks.

Cross-validation protocols filter out disagreements and uncertain cases, ensuring that only consensus-labeled samples are retained for evaluation.

Results

DentalGPT, with only 7B parameters, delivers pronounced improvements in diagnostic accuracy across all curated dental image understanding and VQA benchmarks, notably outperforming state-of-the-art open-source, commercial, and even much larger proprietary MLLMs.

  • On MMOral-OPG-Bench: 60.0% accuracy, significantly ahead of best generalist MLLMs.
  • On Intraoral and Panoramic Classification Benchmarks: Up to 84.0% accuracy, representing substantial gains over both the Qwen2.5-VL-7B-Instruct base and significantly larger competitors. Figure 6

    Figure 6: Accuracy (%) of MLLMs with and without the complex reasoning mode on the MMOral-OPG-Bench.

Ablation analysis shows that both staged data alignment and RL are indispensable: staged enhancement with increasing domain data scales boosts the RL training ceiling, and RL further improves CoT-driven diagnostic reliability. Figure 7

Figure 7: Accuracy reward (%) of MLLM during RL training under different scales of the Stage I dataset.

Case studies provide qualitative evidence of iterative improvement: DentalGPT with RL-augmented reasoning shows true reflective analysis on challenging multimodal tasks, correctly arriving at nuanced diagnoses unreachable by baseline models. Figure 8

Figure 8: Examples of DentalGPT and its backbones analyzing a multimodal task in dentistry. Red indicates incorrect analysis, green indicates correct analysis, and orange highlights reflective turns in the complex reasoning process.

Implications and Future Directions

DentalGPT sets a new paradigm for specialty-centric foundation models in medical AI. The demonstrated efficacy of staged domain alignment and explicit reasoning optimization indicates that parameter count is not the sole determinant of clinical utility for MLLMs. The results suggest that substantial gains are attainable through high-quality, domain-specific supervision and reinforcement objectives tailored to the reasoning demands of the target field.

Practical deployment in clinical decision support and real-time consultation is supported by the compact architecture. The framework is extensible to other medical modalities or granular subfields, provided analogous domain curation and training pipelines. Theoretical implications include evidence for the upper bound of reasoning capabilities attainable via staged RL and the necessity of reflective CoT under expert supervision.

Areas for future exploration include integrating temporal imaging data (e.g., dental progression over visits), multimodal fusion with patient history or structured electronic records, and exploring RLHF variants for higher-quality diagnostic explanations. Further, benchmarking DentalGPT on international and rare-condition cohorts will be essential for real-world generalizability and safety assessment.

Conclusion

DentalGPT advances the state of the art in multimodal reasoning for automated dental diagnostics, illustrating that systematic domain data engineering and staged reinforcement can yield clinically proficient models even at modest model scales (2512.11558). This work underscores the importance of specialty-specific foundations for medical AI and sets the stage for broader applications in precision healthcare, trustworthy automated interpretation, and collaborative practitioner–AI workflows.

Whiteboard

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 38 likes about this paper.