Attainability of Med-AGI via Multimodal Foundation Models

Determine whether scaling large multimodal foundation models, including vision–language models applied to surgical image analysis, can lead to Medical Artificial General Intelligence (Med-AGI) capable of functioning in operative surgical settings.

Background

The paper evaluates state-of-the-art vision–LLMs (VLMs) on surgical tool detection and observes that zero-shot performance remains at or near trivial baselines despite scaling. While fine-tuning improves results, it does not close the generalization gap, and a small specialized model (YOLOv12-m) outperforms all VLM-based approaches.

Against this backdrop, the authors explicitly note that whether these broadly trained models will ultimately yield Medical Artificial General Intelligence (Med-AGI) remains unsettled, motivating further investigation into the limits of scaling versus domain specialization.

References

Despite progress on visual tasks such as surgery, whether these models would lead to Med-AGI is an open question.

A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI  (2603.27341 - Skobelev et al., 28 Mar 2026) in Section 1, Introduction