MedEyes: Advanced Ocular Diagnostics

Updated 4 December 2025

MedEyes are machine learning–enabled digital health systems that combine ocular imaging, eye tracking, and vision-language models for accurate, scalable diagnostic support.
They employ robust CNN architectures, reinforcement learning, and multimodal fusion to achieve high diagnostic accuracy in tasks from AMD screening to cognitive impairment detection.
MedEyes extend across diverse applications, from integrated OCT segmentation to smartphone-based 3D corneal screening, enhancing telemedicine and accessibility in resource-limited settings.

MedEyes denotes a set of machine learning–enabled digital health systems targeting ocular disease detection, progressive diagnostic reasoning, low-cost eye-tracking, gaze modeling, multimodal triage, and clinical workflow integration. The term encompasses both focused disease classifiers (e.g., for AMD) and broad, multi-disease or multi-modal frameworks, including hardware-software co-designs for resource-limited clinical environments and advanced vision-LLMs for automated decision support.

1. Foundational Disease Detection: Retinal Imaging–Based MedEyes Systems

Initial MedEyes pipelines leveraged standard convolutional neural networks (ResNet-18) to discriminate age-related macular degeneration (AMD) from normal cases using fundus images. The principal dataset was ODIR-2019, comprising ≈5,000 images with a highly imbalanced class distribution (90% normal, 10% AMD). Images were resized to $224\times224$ , mean–std normalized, and split into stratified training/validation/test sets (80/10/10%). The model architecture featured a sequence of convolutional and residual blocks culminating in a two-class softmax.

Hyperparameter sensitivity was addressed by exhaustive learning-rate sweep ( $\alpha\in\{1, 0.1, \ldots, 10^{-6}\}$ ). Optimal AMD classification was achieved at $\alpha=10^{-3}$ , yielding $F_1=0.74$ , ROC AUC=0.88, precision=0.83, and recall=0.66 on held-out test data. Sample imbalance was mitigated by per-batch class balancing and inverse class-frequency weighted loss. The resulting ResNet-18 model required 1.8 GFLOPs (≤300 ms on CPU), quantized to 8-bit for low-resource deployment. The design prioritized deployment in under-resourced clinics, supporting integration with slit-lamp cameras, risk-stratified outputs, and regulatory audit trails (Dua et al., 2022).

2. Low-Cost Eye Tracking, Gaze Estimation, and MCI Protocols

MedEyes systems have also been developed for non-invasive eye-tracking and cognitive screening. One implementation uses standard webcam hardware and open-source software for Visual Paired Comparison (VPC) protocols—a validated experimental paradigm for early mild cognitive impairment (MCI) detection. The architecture comprises a Measurement Sub-System (stimulus presentation, video and PPG acquisition) and a Test Management Sub-System (protocol configuration, real-time gaze/HRV monitoring, SQL data storage).

Eye-tracking is achieved by Haar-cascade face/eye detection and a neural gaze regressor (32x32 eye crops, two hidden ReLU layers, sigmoid output for horizontal gaze prediction). Calibration achieved $3.0\%$ of screen width MAE with $100\%$ left/right discriminability. VPC-derived metrics—Novelty Preference (NP) and Familiarity Index (FI)—are computed from fixation dwell times ( $NP = T_n/(T_n+T_o)$ ). Cognitive stress confounds are monitored via HRV (RMSSD, LF/HF) extracted from webcam PPG; a stress index $SI=w_1(LF/HF)-w_2RMSSD$ flags elevated load.

Prototype cost ( $<$ \$1000) undercuts commercial trackers by >10x. Anticipated clinical AUC for MCI discrimination is $\ge$ 0.85, with full evaluation on ROC curve–based endpoints (Greco et al., 25 Jul 2024).

3. Vision-LLMs and Multi-Task Diagnostic Reasoning

MedEyes is also instantiated by foundation models integrating vision and language, designed for primary eye care triage and multi-disease assessment. Meta-EyeFM—architected as an LLM “router” with eight task-specific Vision Foundation Models (VFMs; ViT-large encoders, ViT-small decoders)—dispatches queries to experts by token-based routing (binary cross-entropy–trained gating head). LoRA fine-tuning enables lightweight adaptation.

Supported tasks include five-class ocular disease detection (DR, AMD, MMD, glaucoma, cataract), disease severity staging, sign identification (e.g., AMD drusen, DR microaneurysms), and systemic risk (diabetes, hypertension). Accuracy is $\ge$ 82.2%, AUC $\ge$ 91.2% internally, with external validation $\ge$ 80% for major diseases. The model outperforms Gemini-1.5-flash and ChatGPT-4o LMMs (e.g., +25–26.5% accuracy on AMD, +30–32% on glaucoma), and approaches ophthalmologist-level F1 (0.853 vs human 0.787–0.857). Interface supports natural-language triage by image upload and question, with decision support (Soh et al., 13 May 2025).

4. Gaze Estimation, Clinician Visual Behavior, and Interpretability

MedEyes integrates human-in-the-loop interpretability through gaze modeling. The GEM approach formulates the gaze estimation problem for medical diagnostics as mapping $(X_I,X_T)\to Y$ (image, text cue, fixations), minimizing $\Vert \hat{Y}-Y\Vert_2^2$ . The modular GEM network fuses ResNet-50/CLIP-derived features across scales, utilizes cross-modal and self-attention, then predicts gaze via a final regressor. Visual behavior is further modeled by constructing soft graph correspondences between expert and predicted fixations, with GCN and AIS modules enforcing high-order structural match.

Empirically, GEM achieves MSE=0.0320, [email protected]=50.5% on MIMIC-Eye' test set (outperforming CLAMP, HGTTR, TransVG). Ablation shows both context-awareness and graph matching drive improvements. Predicted search paths are visualizable as heatmaps and trajectory graphs—bolstering fairness, explainability, and clinical learning. Generalizability is confirmed in zero-shot transfer to phrase grounding and multi-modal tasks (Liu et al., 10 Aug 2024).

5. Progressive Visual Reasoning via Dynamic Attention

MedEyes further advances dynamic diagnostic reasoning through reinforcement learning frameworks, modeling clinical workflows as MDPs. At each timestep $t$ , the MedEyes agent's state is $(I,q,n_{<t},\mathcal{G}_t)$ —image, question, past narrative, current attention. Actions are grounded in gaze—selecting image regions or producing answers. Reward integrates accuracy ( $\lambda_{acc}$ ), grammar ( $\lambda_{grammar}$ ), and region-diversity ( $\lambda_{div}$ ) shaping.

Off-policy expert gaze trajectories supplement on-policy exploration: the Gaze-guided Reasoning Navigator (GRN) alternates between global scanning for abnormalities and local drilling for region analysis. Confidence Value Sampler (CVS) uses nucleus sampling to diversify exploration, with adaptive termination. Dual-Stream GRPO unifies PPO-based on-policy gradients and importance-weighted off-policy (expert) supervision with distinct advantage normalization.

Evaluated on five medical VQA benchmarks, this MedEyes formulation obtains 65.9% average accuracy (+8.5% over the previous best), with ablations showing –5.5% to –10.5% loss if expert data, diversity, or reasoning modules are removed. Attention maps show stepwise refinement mimicking clinician search (Zhu et al., 27 Nov 2025).

MedEyes frameworks are extended to lightweight, mobile-compatible applications incorporating multimodal fusion and end-to-end pipelines. InSight employs a three-stage process: CNN-based real-time fundus image quality checker, ResNet-18 plus MetaFusion–based disease predictor (combining image and patient metadata), and DR severity grading. MetaFusion fuses embeddings by element-wise and projection-based mutual correction. Pretraining leverages 130,000 images with combined supervised (disease labels) and self-supervised (MSE) objectives.

On Brazilian BRSET/mBRSET datasets across five diseases, the full MedEyes pipeline achieves 6–17% balanced accuracy gains over image-only models (e.g., 96% for DR, 92% for AMD), filters poor-quality images with 99.5% accuracy, and is five times more efficient (single multi-task backbone). Inference is ~200 ms on mobile CPUs, supporting field deployment (Raghu et al., 16 Jul 2025).

MedEyes integrates into broader ocular telemedicine through web-based, end-to-end ensemble prediction systems for OCT scans and smartphone-based 3D corneal topography. OCT MedEyes uses U-Net for retinal segmentation and an ensemble (InceptionV3, Xception, attention) for 4-way disease classification (CNV, DME, Drusen, Normal), yielding 96.7% accuracy (F1=0.967), rapid inference (~210 ms), and direct patient/clinician access through encrypted web interfaces (Naik et al., 2023).

SmartKC merges 3D-printed optics, an LED ring, and AR-guided smartphone capture to reconstruct the anterior corneal surface by Arc-Step cubic spline fitting and Zernike polynomial expansion. Validated in 101 eyes, SmartKC achieves 94.1% sensitivity and 100% specificity for keratoconus, with sim-K curvature estimates correlating $r=0.78$ (p<0.01) to a \$10,000 reference topographer—demonstrating that MedEyes modules can democratize corneal screening in low-resource settings (Gairola et al., 2021).

MedEyes thus denotes a spectrum of state-of-the-art, modular, and extensible systems for vision-based ophthalmic and neurocognitive diagnostics, integrating algorithmic advances in deep learning, reinforcement learning, gaze modeling, multimodal fusion, and cost-effective hardware across a range of clinical tasks, populations, and environments.