EndoSight AI: Modular AI for GI Endoscopy

Updated 24 November 2025

EndoSight AI is an integrated, modular AI platform for gastrointestinal endoscopy and minimally invasive surgery that employs deep neural networks, interpretability modules, and robotics.
It leverages diverse architectures such as EfficientNet, YOLO models, and transformers to achieve high performance, with metrics like mAP up to 97.8% and real-time inference speeds exceeding 35 FPS.
Designed for seamless clinical integration, EndoSight AI provides real-time procedural guidance, transparent explainability, and adaptive workflows to enhance diagnostic accuracy and surgical education.

EndoSight AI is an integrated, modular artificial intelligence platform for gastrointestinal endoscopy and minimally invasive surgery (MIS), combining deep neural networks, interpretability modules, robotics integration, and agent-based multimodal workflow orchestration. The system is designed to deliver high-accuracy lesion detection and classification, procedural guidance, surgical instrument tracking, visual reasoning, and real-time feedback, underpinned by rigorous dataset curation, scalable architectures, and transparent explainability mechanisms. EndoSight AI amalgamates advances in vision transformers, convolutional networks, foundation models, memory-guided agents, and optimized deployment protocols to meet the demands of clinical practice, surgical education, and autonomous navigation.

1. Model Architectures and Core Modules

EndoSight AI employs a diverse array of architectures targeted to distinct procedural and diagnostic tasks:

Deep Classification and Interpretability: For GI lesion recognition, EndoSight utilizes an EfficientNet-B3 backbone, with a custom regularized dense head incorporating BatchNormalization, dropout, and L1/L2 penalties. The full model comprises ≈11.1 million parameters and addresses classification across eight Kvasir-V2 categories without heavy data augmentation. Interpretability is ensured via LIME, which generates superpixel-based saliency maps identifying influential regions for each prediction (Kamble et al., 2 Mar 2025).
Polyp Detection and Segmentation: A dual-stage pipeline integrates YOLOv8n (nano) for real-time object localization and a custom U-Net for boundary segmentation. The detection module leverages anchor-free, grid-based prediction, while segmentation employs a compounded Dice/BCE loss. A bespoke thermal-aware training protocol regulates GPU loads during intensive training. The system achieves [email protected]=88.3% and mean Dice of 69% on the Hyper-Kvasir dataset, sustaining >35 FPS on commodity GPUs (Cavadia, 17 Nov 2025).
Surgical Instrument Tracking: A YOLOv5-based framework undergoes systematic ablation, replacing CSPDarknet53 with VGG-11+SPP, FPN or Bi-FPN necks, and auto-anchors to optimize speed and precision for instrument detection. Variants achieve up to 97.8% [email protected] at 3.6 ms inference per frame. The system manages eight instrument classes, with labeling focused on tip localization (Onyeogulu et al., 2022).
Interventional US Guidance: In ICE manipulation, EndoSight fuses clinical sequences and synthetic tip overlays, using a pretrained US foundation model coupled to an 8-layer transformer for windowed temporal inference. The system predicts tip passing points and incident angles with mean errors of 3.32° (entry) and 12.76° (rotation), with bounding box IoU=0.66, enabling closed-loop robotic tip visibility control at 25 Hz (Huh et al., 8 May 2025).
Foundation Model Fusion for Cancer Staging: For EGJA diagnosis, EndoSight adopts a mixture-of-experts design, blending global (DINOv2 ViT-S/14) and local (ResNet-50) features via a gating network, followed by a softmax classifier. This yields test accuracy up to 0.9256, AUC=0.9818, and patient-level accuracy of 0.9464 on large multicenter datasets, outperforming both classical CNNs and human experts (Ma et al., 22 Sep 2025).
Memory-Guided Agent Orchestration: A dual-memory, reflective agent (originally EndoAgent) coordinates a suite of expert tools (e.g., YOLOv8 detection, UniMed segmentation, ColonGPT VQA, GPT-4o reporting) through iterative reasoning over short-term and long-term memory. The actor LLM adaptively selects tools based on the context, leveraging feedback and memory-gating at each round. On the EndoSight Bench, this agent attains visual task accuracies >85% and language generation scores >95%, substantially exceeding contemporaneous medical MLLMs (Tang et al., 10 Aug 2025).

2. Dataset Curation, Preprocessing, and Training Protocols

EndoSight AI modules are trained and validated on rigorously curated, task-specific datasets:

GI Lesion Classification: Kvasir-V2, 8,000 images evenly distributed over eight classes, with a strict train/val/test (80/10/10) split and minimal augmentation (resize, normalization, horizontal flipping) (Kamble et al., 2 Mar 2025).
Polyp Detection/Segmentation: Hyper-Kvasir (1,000 images with masks, bounding boxes), split 70/15/15, employing stratified randomization. Preprocessing for detection involves letterbox and YOLO-format conversion; segmentation images are center-padded and thresholded. Minimal augmentation is used to maintain clinical realism. Optimization relies on Adam(AdamW), chunked epochs, early stopping, and temperature-based training (Cavadia, 17 Nov 2025).
Instrument Detection: ~3.5K frames extracted from two porcine laparoscopic videos, labeled for eight instrument types. Augmentation includes flipping, mosaic, HSV jitter, and random scaling (Onyeogulu et al., 2022).
US Catheter Guidance: 5,698 ICE-tip frame pairs, constructed via hybrid fusion of clinical cine loops and synthetic overlays, using precise EM-tracked ground truth mappings. Split: 5,400 train, 48 validation, 250 test (Huh et al., 8 May 2025).
EGJA Staging: 12,302 images from 1,546 patients across seven hospitals, with images standardized and annotated by teams of experienced endoscopists/pathologists. Preprocessing employs resizing, normalization, online crops, and color jittering (Ma et al., 22 Sep 2025).
Agent Benchmarks: 5,709 visual QA pairs (EndoSight Bench), combining private and six public datasets. Distribution covers lesion classification, quantification, grounding, captioning, and reporting (Tang et al., 10 Aug 2025).

Training regimes include staged fine-tuning, use of advanced optimizers (Adam, AdamW), dropout, weight decay, and learning rate scheduling, tailored to each submodule. In high-throughput settings, inference optimization via mixed-precision, ONNX export, and TensorRT is documented.

3. Performance, Evaluation, and Comparative Outcomes

EndoSight AI exhibits state-of-the-art performance in multiple clinical domains:

GI Lesion Classification: Macro-average accuracy 94.25%, precision 94.29%, recall 94.24%, specificity 99.18%, with diagonal confusion matrix entries exceeding 90% for all classes (Kamble et al., 2 Mar 2025).
Polyp Detection/Segmentation: [email protected]=88.3%, Dice=69%, IoU=57.7%, with real-time inference speeds >35 FPS, supporting multi-polyp tracking and risk stratification (Cavadia, 17 Nov 2025).
Instrument Detection: [email protected]=0.978 (fastest variant), inference time as low as 3.6 ms/frame. Comparative studies show strong outperformance against YOLOv7, YOLOR, Scaled-YOLOv4 on custom datasets (Onyeogulu et al., 2022).
EUS Station Recognition: Balanced accuracy 89.0%, weighted precision 90.0%, recall 88.9% (DenseNet161 + denoising). Grad-CAM overlays confirm alignment with anatomical practice (Ramesh et al., 2023).
ICE Catheter Tip Tracking: Entry angle error 3.32°±2.1°, rotation error 12.76°±8.3°, bounding box IoU=0.66. Real-time prediction (25 Hz) supports closed-loop robotic adjustment (Huh et al., 8 May 2025).
EGJA Staging: Accuracy 0.9256 (held-out), 0.8895 (external), 0.8963 (prospective). Endoscopist accuracy with AI assistance improves significantly across trainee, competent, and expert groups. EndoSight achieves higher sensitivity/consistency than all comparators (Ma et al., 22 Sep 2025).
Reflective Agent: On EndoSight Bench, visual average 85.36%, language average 98.11%; consistently superior to LLaVA-Med, HuatuoGPT-Vision, Qwen-VL-Plus, Gemini 2.5 Pro, and GPT-4o (Tang et al., 10 Aug 2025).

4. Clinical Integration, Interpretability, and Workflow Adaptation

EndoSight AI is engineered for seamless deployment in real-world endoscopy and MIS environments:

Inference Latency: EfficientNet-B3 model: ~3.2 s; YOLOv8n+UNet: >35 FPS; YOLOv5-VGG variants: ≈278 FPS; ICE guidance: 40 ms per frame; agent toolkit: 50–200 ms/tool call, supporting near-real-time feedback (Cavadia, 17 Nov 2025, Onyeogulu et al., 2022, Huh et al., 8 May 2025, Tang et al., 10 Aug 2025).
Interface: Interactive, modular GUI (e.g., Gradio) embedded in endoscopy consoles, live overlays, reporting, and integration into hospital PACS/EHR for real-time and retrospective reviews (Tang et al., 10 Aug 2025).
Explainability: LIME saliency for region-based interpretability, Grad-CAM overlays for anatomical validation, t-SNE clustering of image embeddings, and agent memory logs for audit trails (Kamble et al., 2 Mar 2025, Ramesh et al., 2023, Ma et al., 22 Sep 2025, Tang et al., 10 Aug 2025).
Workflow Modes: Supports live frame analysis, post-procedural batch review, collaborative reporting, and integration of XAI visualizations in clinical decision support (Cavadia, 17 Nov 2025, Tang et al., 10 Aug 2025).
Hardware/Deployment: Designed for on-premise GPU servers, endoscopy towers with embedded AI modules, edge deployment (FPGA/mobile GPU), and robotic integration with kinematic feedback (Cavadia, 17 Nov 2025, Huh et al., 8 May 2025).

5. Regulatory, Validation, and Limitations

EndoSight AI development reflects meticulous regulatory and scientific validation:

Prospective and Multicenter Evaluations: Datasets sourced across multiple hospitals and clinical scenarios, with external and prospective splits to assess generalizability (Ma et al., 22 Sep 2025).
Clinical Trials and User Studies: EndoSight is subject to preclinical animal studies, in-vivo human validation (e.g., ICE/robotics), and surgical education user trials measuring safety, efficacy, and retention (Ma et al., 22 Sep 2025, Huh et al., 8 May 2025).
Auditability and QA: Comprehensive logging of memory, reflections, and outputs, designed to meet medicolegal standards and facilitate FDA 510(k)/CE-mark software-as-medical-device submissions (Tang et al., 10 Aug 2025, Huh et al., 8 May 2025).
Limitations: Dataset size and diversity, computational cost of explainability for some methods (e.g., LIME), single-modality data constraints, domain shift across centers and devices, and evolving real-time requirements cited as areas of ongoing work (Kamble et al., 2 Mar 2025, Tang et al., 10 Aug 2025, Cavadia, 17 Nov 2025).

6. Future Directions

Ongoing and proposed enhancements for EndoSight AI include:

Data Expansion/Generalization: Scaling to broader, multi-modality, multi-center datasets; incorporation of underrepresented anatomical classes; and domain-adaptive transfer learning (Cavadia, 17 Nov 2025, Ma et al., 22 Sep 2025).
Model Extensions: Multi-task learning for pathological sub-types (e.g., adenoma vs. hyperplastic polyps), integration of video/temporal sequence models (CNN-LSTM), and adoption of hardware-accelerated, quantized inference (Kamble et al., 2 Mar 2025, Cavadia, 17 Nov 2025).
Advanced XAI: Evaluation and deployment of Grad-CAM, SHAP, and tailored saliency mapping for real-time, clinician-facing explanations (Kamble et al., 2 Mar 2025, Ramesh et al., 2023).
Clinical Workflow Integration: Embedding in EHR, check-listing automation, video analytics for quality improvement, and decision support across pathology, genomics, and interventional robotics (Tang et al., 10 Aug 2025, Ma et al., 22 Sep 2025).
Educational Platforms: Mixed-reality visualization and simulation-driven skill acquisition, leveraging agent-generated 3D cues and feedback (Liu et al., 4 Nov 2025).

EndoSight AI thus represents a comprehensive, interpretable, and modular AI platform spanning diagnostic, interventional, and educational frontiers in endoscopy, with validated performance, clinical readiness, and expansion pathways for next-generation intelligent medical systems.