GI-Bench: Endoscopic MLLM Evaluation
- GI-Bench is a comprehensive evaluation platform for multimodal large language models in gastrointestinal endoscopy, integrating realistic clinical workflows.
- It combines anatomical, diagnostic, descriptive, and management tasks across 20 lesion categories using robust metrics like Macro-F1 and mIoU.
- The platform exposes critical limitations including spatial grounding bottlenecks and a fluency-accuracy paradox, setting a high standard for clinical AI translation.
GI-Bench is a comprehensive evaluation platform for Multimodal LLMs (MLLMs) in gastrointestinal endoscopy, operationalizing a five-stage clinical workflow to reveal the "knowledge-experience dissociation" between algorithmic and human performance. The benchmark addresses the need for panoramic, workflow-aligned, and clinically relevant assessment of MLLMs by integrating anatomical, diagnostic, descriptive, and management tasks across twenty lesion categories. GI-Bench’s dynamic leaderboard and rigorous task metrics position it as a reference standard for both model development and clinical translation in endoscopic AI (Zhu et al., 13 Jan 2026).
1. Motivation and Conceptual Scope
GI-Bench was developed to bridge critical gaps in MLLM evaluation for endoscopic practice. Conventional benchmarks in medical computer vision tend to isolate singular classification or visual question-answering challenges, omitting the multistep reasoning and spatial judgment required in clinical workflows. GI-Bench emulates the full cognitive pipeline of endoscopy, asking whether an MLLM can not only recall textbook knowledge but also apply spatial precision and interpretative expertise at the level of practicing clinicians.
The five-stage workflow encoded in GI-Bench comprises:
- Q1 – Anatomical Localization: Determination of the gastrointestinal segment (e.g., esophagus, stomach, duodenum, colorectum).
- Q2 – Lesion Identification (Spatial Grounding): Localization and bounding box annotation of pathology.
- Q3 – Diagnosis: Selection of the most likely lesion category from a closed set.
- Q4 – Findings Description: Generation of a clinically valid, concise report on observed endoscopic features.
- Q5 – Management Recommendations: Suggestion of next steps consistent with established clinical guidelines.
This systemic framework directly aligns test protocols with endoscopic reality and sets a higher bar for MLLM utility assessment.
2. Benchmark Structure and Annotation Protocol
GI-Bench’s image bank is curated from both gastroscopy and colonoscopy archives at a tertiary center, under STARD-AI guideline-driven quality control. It comprises twenty meticulously stratified lesion classes, representing a spectrum of pathologies in the esophagus, stomach, and colorectum. Each class is initially populated with fifty high-resolution, white-light images; the intersection of valid model outputs yields a benchmarking pool of approximately 947 images.
A dedicated "Human–AI Comparison" subset (60 images, evenly sampled per class) enables direct head-to-head evaluation with three junior endoscopists and three residency trainees.
Key annotation protocols include:
- Dual-track, expert adjudication of disease categories, referencing pathology or clinical reports.
- Reconciled consensus bounding box creation via LabelMe annotation and box difference resolution.
- Independent dataset splits and audit logs to enable reproducibility and standardized assessment.
The exclusive use of static, de-identified endoscopic images sets benchmarks for model performance in single-frame inference, avoiding temporal confounds.
3. Metrics and Rating Procedures
GI-Bench employs multimodal metrics tailored to the target task type:
- Classification/Localization Tasks (Q1, Q2, Q3):
- Macro-F1 Score: For classes, with per-class F1 given by , the overall Macro-F1 is .
- Mean Intersection-over-Union (mIoU): For images with predicted bounding box and ground-truth , ; overall .
- Generative/Descriptive Tasks (Q4, Q5):
- Five-Dimension Likert Scale (1-5): Each output is rated by blinded clinical experts (validated by a GPT-5 adjudicator, intraclass correlation >0.91) in: linguistic quality/readability, evidence grounding/feature coverage, factual and clinical correctness, actionability/guideline alignment, safety/risk management.
All metric calibrations are statistically analyzed, including significance (p-values) for model-versus-human comparisons.
4. Model Benchmarking and Comparative Results
GI-Bench systematically evaluated twelve advanced MLLMs (proprietary, general open-source, and medical-tuned open-source) against clinician baselines. Gemini-3-Pro emerged as the top model across major tasks. Key results include:
- Anatomical Localization (Q1, Macro-F1): Best model ~0.231 (Gemini-3-Pro), showing parity with trainees (0.226) and juniors (0.251) ().
- Lesion Localization (Q2, mIoU): Top model (Gemini-2.5-Pro, 0.385) distinctly below trainees (0.506) and juniors (0.543) ().
- Diagnosis (Q3, Macro-F1): Gemini-3-Pro (0.641) exceeded trainees (0.492, ), performing comparably to juniors (0.727, ).
- Findings Description (Q4, Likert): Gemini-3-Pro (2.557 ± 0.750) matched human readability, without significant difference.
- Management Recommendations (Q5, Likert): Gemini-3-Pro (2.867 ± 1.039) outperformed both trainees and juniors in guideline-aligned actionability ().
The following table summarizes selected comparative results:
| Task | Best Model Score | Trainee Avg | Junior Endoscopist Avg |
|---|---|---|---|
| Q1 (Macro-F1) | ~0.231 | 0.226 | 0.251 |
| Q2 (mIoU) | 0.385 | 0.506 | 0.543 |
| Q3 (Macro-F1) | 0.641 | 0.492 | 0.727 |
| Q4 (Likert) | 2.557 ± 0.750 | 2.371 ± 0.368 | 2.442 ± 0.504 |
| Q5 (Likert) | 2.867 ± 1.039 | 2.126 ± 0.682 | 1.950 ± 0.263 |
The notation: * denotes vs Residency Trainee Avg; †for vs Junior Endoscopist Avg.
5. Analytical Insights: Bottlenecks and Paradoxes
GI-Bench revealed two major limitations in current MLLM paradigms:
Spatial Grounding Bottleneck: All evaluated MLLMs demonstrate substantial inferiority to humans in pixel-level lesion localization. The best model’s mIoU (0.385) is statistically below human baselines (0.50). This deficit results from vision transformers' proficiency in capturing global semantics but failure to resolve high-resolution spatial details necessary for pinpointing biopsy targets and procedural planning.
Fluency-Accuracy Paradox: Several MLLMs, notably GPT-4o, systematically produce endoscopic reports that are more linguistically fluent than human experts (). However, this fluency coexists with reduced factual and clinical accuracy (). Qualitative error modes include "over-interpretation"—hallucinating vascular or mucosal patterns not actually present—and "defensive medicine," wherein models enumerate excessive procedural options instead of tailoring recommendations. The plausible implication is a hidden risk in clinical deployment, as convincing yet incorrect narratives could mislead users without expert oversight.
6. Platform Architecture and Leaderboard Dynamics
GI-Bench is underpinned by a public, continuously updated leaderboard (https://roterdl.github.io/GIBench/), tracking model and human performance per task. The evaluation workflow leverages community submissions via GitHub forks: users upload outputs in a prescribed JSON template, triggering CI-based automated benchmarking and leaderboard refresh. This infrastructure supports rapid evolutionary tracking of advances in endoscopic MLLM capability and democratizes performance reporting.
7. Clinical Relevance and Prospective Development
The GI-Bench framework equips the research community to explicitly quantify gaps between algorithmic and experiential cognition in endoscopic AI. Future research is steered toward:
- Enhancing spatial grounding via patch-level embeddings and retrieval-augmented pretraining.
- Extending benchmarking to video, capturing motion and temporal cues integral to live procedures.
- Diversifying datasets across centers and device vendors to reinforce generalizability.
In current practice, GI-Bench advocates a "Human-Directed, AI-Refined" paradigm, where clinicians maintain ultimate control and judgment, with MLLMs positioned as copilots for report generation and guideline adherence. This approach facilitates workflow efficiency augmentation without sacrificing patient safety (Zhu et al., 13 Jan 2026).