DeepEyeNet Dataset

Updated 29 December 2025

DeepEyeNet is a multi-modal retinal image dataset of 15,710 images paired with expert free-form captions and diagnostic keywords.
It offers diverse imaging modalities, including fundus photography and fluorescein angiography, to support robust evaluation of captioning models using metrics like BLEU, ROUGE-L, and CIDEr.
The dataset fosters multi-task learning in clinical ophthalmology by enabling research on feature fusion, zero-shot keyword expansion, and multi-label classification.

DeepEyeNet is a large-scale, multi-modal retinal image dataset constructed for the investigation and benchmarking of machine vision–LLMs in the context of clinical ophthalmology, with a primary focus on the automatic generation of medical descriptions (captioning) from retinal images. Distinguished by its size, clinical diversity, and dual annotation schema (expert-written captions and structured diagnostic keywords), DeepEyeNet has emerged as the primary dataset for comparative evaluation in retinal image captioning research (Cherukuri et al., 2024, Shaik et al., 2024).

1. Dataset Composition and Coverage

DeepEyeNet comprises a total of 15,710 retinal images, each associated with one expert-written clinical description and an accompanying set of diagnostic keywords (typically 5–10 keywords, with some images containing up to 15). The overall controlled vocabulary consists of 609 unique diagnostic keywords, mapped onto 265 distinct retinal diseases encompassing a wide clinical spectrum, from prevalent disorders such as diabetic retinopathy and age-related macular degeneration to rare entities like Goldmann–Favre syndrome and parafoveal telangiectasia.

The imaging modalities represented and their relative frequencies are given below:

Modality	Image Count	Fraction
Color Fundus Photography (Fundus)	13,898	≈ 88.55%
Fluorescein Angiography (FA)	1,811	≈ 11.53%
Optical Coherence Tomography (OCT)	Not specified	Not specified
Multi-modality grids (combinations)	Present	Not specified

All images are preprocessed to 356×356 pixels with three color channels (RGB). Original device, resolution, and bit-depth information is not documented (Cherukuri et al., 2024, Shaik et al., 2024).

2. Annotation Protocol

Annotations are performed by expert ophthalmologists. For each image, a free-form clinical description (caption) averaging 5–10 words (with an upper bound of 50) is provided alongside a diagnostic keyword set (average 5–10 terms; up to 15). There is exactly one caption per image; no multi-annotator consensus or adjudication process is reported, nor are formal annotation guidelines or inter-annotator agreement statistics presented.

Keywords and captions are unconstrained in form—captions do not follow rigid templates or contain explicit "findings/impression" subsections. Keyword lists use a special “[SEP]” token as delimiter during preprocessing but otherwise serve as a flat set rather than a hierarchical taxonomy. No segmentation masks or region-level annotations are included, and clinical context (such as age or gender) is not explicitly structured beyond possible mentions in the free text (Cherukuri et al., 2024, Shaik et al., 2024).

3. Data Partitioning and Preprocessing

The dataset is split into training, validation, and test cohorts according to a 60/20/20 ratio, yielding approximately 9,426 training, 3,142 validation, and 3,142 test images. Splitting is presumed random; no details on stratification protocol are provided.

Preprocessing Pipeline

Image Preprocessing: Uniform resizing to 356×356 pixels, center-cropping or padding for square aspect ratio, and conversion to 3-channel RGB. No additional normalization, color jitter, or augmentation is described.
Text Preprocessing: Cleaning (removal of non-alphabetic characters, lowercasing), restriction of caption/keyword length (≤50 words for captions, 5–50 for keywords), rare word replacement (<UNK>), and vocabulary capping at 5,000 tokens. Keyword lists use “[SEP]” tokens. Word embeddings employed in model training have 1,024 dimensions.

Tokenization algorithms are not specified but presumed to be standard word-level. Details such as stop-word removal or lemmatization are unreported (Cherukuri et al., 2024, Shaik et al., 2024).

4. Clinical and Demographic Properties

The 265 disease classes span a broad range of retinal pathology types, systematically mapped from both high-incidence and rare disorders. Specific prevalence statistics per class or distributional summaries are not publicly detailed. Patient demographics (age, gender, ethnicity) are not tabulated, though unstructured captions may refer to such attributes incidentally.

Compared to prior datasets such as those used in DeepOpht, DeepEyeNet's coverage—both in terms of disease spectrum and imaging modalities—is more extensive. Unlike corpora focused exclusively on a single disease (e.g., diabetic retinopathy), DeepEyeNet enables modeling across diverse clinical presentations, thus facilitating multi-task and multi-label learning scenarios (Cherukuri et al., 2024, Shaik et al., 2024).

5. Evaluation Metrics and Experimental Usage

Standard evaluation metrics for image captioning are used in benchmarking on DeepEyeNet:

BLEU@N (N=1,2,3,4): n-gram precision with brevity penalty,

$\text{BLEU@N} = \exp\left(\sum_{k=1}^N w_k \log p_k\right) \times BP$

with $BP = 1$ if $c > r$ , $BP = \exp(1 - r/c)$ if $c \leq r$ .

ROUGE-L: Measures longest common subsequence between candidate and reference:

$\text{ROUGE-L} = \frac{(1 + \beta^2) P_{LCS} R_{LCS}}{R_{LCS} + \beta^2 P_{LCS}}$

where $\beta=1$ .

CIDEr: Consensus-based tf–idf weighted n-gram similarity:

$\text{CIDEr} = \frac{1}{M}\sum_{i=1}^M CIDEr_i$

METEOR: Not reported as an official evaluation but included in standard practice.

No modifications to standard metric weighting or custom metrics are reported in the source studies (Cherukuri et al., 2024, Shaik et al., 2024).

In comparative studies, DeepEyeNet is used as the primary evaluation corpus for novel models such as the Multi-modal Medical Transformer (M3T) and the Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer (GCS-M3VLT). For instance, M3T achieved a BLEU@4 of 0.208, a 13.5% absolute improvement over the best evaluated baseline (Contextualized Keywords, BLEU@4 = 0.073) (Shaik et al., 2024). GCS-M3VLT demonstrated a further 0.023 BLEU@4 improvement, underscoring the incremental benefit of feature fusion strategies (Cherukuri et al., 2024).

6. Comparative Scope and Limitations

Relative to previously used datasets, DeepEyeNet offers several distinctive advantages:

Scale: With over 15,700 images, DeepEyeNet is among the largest publicly reported end-to-end retinal captioning datasets. Previous datasets were typically smaller and constrained to a single imaging technique or disease.
Semantic Breadth: The inclusion of both free-form clinical descriptions and controlled keyword sets facilitates the development and evaluation of multi-modal and multitask models (e.g., image-to-keyword and image-to-caption objectives).
Multi-modality: The dataset's explicit support for both fundus and angiography images (with OCT present but not fully quantified) enables cross-modal reasoning not feasible with single-modality datasets.

Key limitations include:

Incomplete keyword coverage: Some images lack associated keywords. The primary studies suggest ameliorating this through zero-shot learning or by extending the keyword taxonomy.
Lack of pixel-level or region-level annotation: No segmentation, bounding box, or attribute localization is provided.
Metadata gaps: Original device types, resolutions, patient demographics, and clinical sites are undocumented. No indication is given of multi-center or multi-scanner acquisition, nor are advanced preprocessing or normalization schemes discussed.
A plausible implication is that standardizing these features could further broaden DeepEyeNet's impact for tasks requiring higher granularity or domain adaptation.

7. Prospects and Research Directions

Ongoing research focuses on several axes suggested by the primary dataset authors:

Keyword Expansion and Zero-Shot Learning: Addressing missing keywords and unseen disease classes via external lexical expansion or zero-shot transfer paradigms.
Multi-modal Embedding Alignment: More tightly coupling image and text representations in a shared latent space is advocated as a route to improved captioning and retrieval performance.
Dataset Extension: While no explicit plans to increase scale, add new devices, or include pixel-level annotation are documented, such directions align with broader trends in medical vision-language learning.

DeepEyeNet has cemented its role as a benchmark for innovation at the intersection of computer vision, natural language processing, and clinical ophthalmology, providing a rigorous testbed for both incremental and conceptual advances in automated medical report generation (Cherukuri et al., 2024, Shaik et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning (2024)

M3T: Multi-Modal Medical Transformer to bridge Clinical Context with Visual Insights for Retinal Image Medical Description Generation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepEyeNet Dataset.