Retinal Image Captioning Advances

Updated 1 May 2026

Retinal image captioning is the automated generation of detailed, clinically relevant diagnostic reports from retinal images using multi-modal vision-language architectures.
Advanced methods integrate guided self-attention and cross-modal fusion to accurately align high-dimensional visual features with expert diagnostic keywords.
Evaluations on datasets like DeepEyeNet show significant gains in BLEU and CIDEr metrics, supporting improved screening and telemedicine applications.

Retinal image captioning is the automatic generation of descriptive, clinically relevant textual reports conditioned on retinal images—fundus photography, fluorescein angiography (FA), or optical coherence tomography (OCT). This task extends traditional medical image analysis by aiming to produce structured, multi-sentence diagnostic narratives rather than single-label predictions. Retinal image captioning models often adopt multi-modal architectures that fuse high-dimensional image features with external linguistic context (notably, expert-defined diagnostic keywords), addressing the intricate variability and subtlety in ophthalmic pathology. The approaches in this domain synthesize advances from vision-language modeling, self-attention mechanisms, and clinical language modeling, evaluated predominantly on datasets such as DeepEyeNet using metrics that balance n-gram precision and clinical semantic fidelity.

A defining feature across state-of-the-art retinal image captioning systems is the explicit integration of visual and textual signals within an end-to-end, differentiable architecture. Three influential architectures exemplify the evolution of this paradigm:

Guided Context Self-Attention Vision-Language Transformers (GCS-M3VLT): This model consists of a vision encoder equipped with a Guided Context Attention (GCA) block, a Transformer-style keyword language encoder, a vision-language TransFusion encoder for cross-modal fusion, and a GPT-2-inspired autoregressive decoder. GCA combines spatial context pooling and channel context fusion, followed by an attention gating mechanism to highlight lesion regions, even under image quality variation. The TransFusion encoder leverages cross-attention to inject clinical guidance from diagnostic keywords into the fused multi-modal representation, enhancing clinical specificity in the generated captions (Cherukuri et al., 2024).
Contextualized Keyword Representations Approaches: These models encode expert-specified keywords using masked self-attention and decoder-like transformer stacks to construct contextualized keyword representations. Visual features (from CNNs such as VGG16/VGG19) and keyword vectors are concatenated and supplied to a Bi-LSTM caption decoder. This enables the modeling of both local image content and the global, expert-provided context in the resulting caption (Huang et al., 2021).
Deep Context-Encoding Networks: This architecture fuses input image features and diagnostic keywords sequentially via an LSTM-based encoder, generating an interactive latent representation. The decoder is a Bi-LSTM that consumes the fused vector, the current word embedding, and the image features at each time step, generating the output report with improved incorporation of clinically significant terms (Huang et al., 2021).

The architectural progression—from context fusion via LSTM stacking, to explicit self-attention/cross-attention—parallels broader trends in multi-modal vision-language research.

2. Visual and Textual Feature Encoding

The image encoder typically utilizes a CNN backbone (EfficientNetV2, VGG16/VGG19, or ResNet/GoogLeNet pretrained on ImageNet), transforming high-resolution scans ( $356\times356\times3$ ) into dense feature maps (e.g., $12\times12\times1280$ in GCS-M3VLT). Within GCS-M3VLT, the Guided Context Attention mechanism operates in two stages:

Spatial Context Pooling: Computes spatial descriptors by weighting spatial locations via a point-wise convolution, emphasizing global spatial patterns.
Channel Context Fusion: Applies gated fusion of channel features, enabling the system to amplify salient lesion-specific responses.
Attention Gating: Employs learned gating to combine query, key, and value representations, outputting spatially and channel-wise recalibrated features.

The language encoder processes diagnostic keywords, mapping them into high-dimensional embeddings (e.g., 1024-dim) followed by multi-head self-attention transformers or LSTM stacks, promoting contextualization and disambiguation of clinical terminology. Masked self-attention ensures autoregressive ordering in keyword processing (Huang et al., 2021).

Multi-modal fusion strategies vary:

Concatenative Fusion: In baseline systems, image feature vectors and keyword representations are concatenated as the input to a sequence decoder.
Cross-Attention: More advanced methods (e.g., GCS-M3VLT) employ cross-attention, where visual queries attend to keyword keys/values, aligning visual features with clinically relevant context and enabling a richer entanglement of modalities (Cherukuri et al., 2024).
Interactive Encoding: Sequential context fusion with LSTMs combines image and textual signals in order-preserving fashion, capturing higher-order interactions (Huang et al., 2021).

Caption generation is typically autoregressive, using decoders modeled after GPT-2 (transformer blocks) or Bi-LSTMs, with each prediction conditioned on previous tokens and fused multi-modal context.

4. Training Regimes and Evaluation Metrics

Supervised model training relies on the DeepEyeNet dataset, which comprises 15,710 images spanning multiple modalities, 265 disease classes, 609 unique diagnostic keywords, and expert-generated captions (maximum 50 words, vocabulary size 5,000). Dataset splits are 60% training, 20% validation, 20% test.

Data preprocessing includes image resizing to $356\times356$ , truncation or replacement of rare keyword tokens, and no auxiliary pretraining beyond base CNN weights. Key training hyperparameters: embedding dimensions (typically 1024), 6-layer transformer/decoder blocks, 8 attention heads, batch size 64, Adam optimizer with $\beta_1=0.9, \beta_2=0.999$ , learning rate $1\times10^{-4}$ , 100 total epochs (for GCS-M3VLT).

The standard training objective is cross-entropy loss over ground-truth tokens. No reinforcement learning or CIDEr-optimized reward is applied in these models.

Primary evaluation metrics include BLEU-1 through BLEU-4, CIDEr, and ROUGE-L, gauging n-gram overlap and consensus with reference reports. GCS-M3VLT achieves a BLEU-4 of 0.231 and a CIDEr of 0.559, with improvements over the previous M3 Transformer baseline by +0.023 BLEU-4 (Cherukuri et al., 2024). Earlier multi-modal and context-driven systems report BLEU-avg increases of up to +53.2% and CIDEr gains of +18.6% relative to image-only baselines (Huang et al., 2021). Deep context-encoding networks demonstrate a +16% BLEU-avg and +10.2% CIDEr improvement (Huang et al., 2021).

Model	BLEU-4	CIDEr	ROUGE-L
GCS-M3VLT (Cherukuri et al., 2024)	0.231	0.559	0.497
M3 Transformer (Cherukuri et al., 2024)	0.208	0.537	0.429
Expert Transformer (Cherukuri et al., 2024)	0.186	0.472	0.413
Context. Keywords (VGG16, beam=3) (Huang et al., 2021)	0.118*	0.351*	—
Deep Context-Encoding (Huang et al., 2021)	0.116*	0.398*	0.252

*BLEU-avg reported for these models.

5. Qualitative Findings and Clinical Relevance

Qualitative evaluation demonstrates that multi-modal fusion significantly enhances the salience and specificity of automatically generated captions. Examples include:

Model-generated captions accurately incorporating nuanced clinical findings such as “subretinal neovascularization and focal leakage consistent with neovascular AMD,” closely mirroring ground-truth reports and outperforming baselines, which tend to omit crucial details or confuse imaging modalities (Cherukuri et al., 2024).
Context-driven and keyword-fusion systems display improved anchoring of specialized diagnostic terms (e.g., “x-linked retinoschisis,” “diffuse neuroretinitis”) and mitigate common errors observed in image-only captioners, such as omission of clinically salient findings or introducing extraneous diagnoses (Huang et al., 2021).

These capabilities are significant for automated preliminary screening, triage, and the reduction of ophthalmologist workload, provided generated captions are consistently accurate.

6. Limitations and Prospective Directions

Data and Domain Limitations: Performance may degrade on extremely low-contrast images or out-of-distribution modalities; trained models’ transferability to alternative datasets or imaging devices remains unproven (Cherukuri et al., 2024).
Fusion Mechanisms: Early fusion approaches lack explicit cross-modal attention, potentially limiting fine-grained alignment between pixel-level signals and clinical descriptors (Huang et al., 2021).
Clinical Report Scope: Current models do not generate patient demographic attributes not present in images. Extension to structured fields or meta-data conditioning is nontrivial (Huang et al., 2021).
Future Work: Proposed trajectories include aligning multi-modal embeddings via contrastive or adversarial training to improve robustness, and incorporating zero-shot learning for handling unseen diagnostic terminology and rare pathologies without additional labels (Cherukuri et al., 2024). Incorporation of advanced fusion (bilinear pooling, co-attention) and text modeling (pretraining on ophthalmic corpora) are suggested (Huang et al., 2021).

7. Summary and Impact

Retinal image captioning leverages multi-modal neural architectures to synthesize structured ophthalmic reports from heterogeneous imaging inputs with supplementary clinical context. The progression from LSTM-based context fusion to transformer-level cross-modal attention reflects increasing sophistication in aligning image and language domains. Recent systems, notably GCS-M3VLT, achieve state-of-the-art performance on DeepEyeNet in both quantitative metrics and qualitative clinical fidelity. These advances support future translation to screening and telemedicine workflows, with the caveat that strong validation across imaging domains and explicit handling of unseen terms remain essential for deployment (Cherukuri et al., 2024, Huang et al., 2021, Huang et al., 2021).