Retinal Image Quality Assessment (RIQA)
- Retinal image quality assessment is the algorithmic evaluation of retinal images to ensure they meet clinical and diagnostic standards, using both color fundus photography and OCTA.
- Modern approaches employ deep neural architectures, fusion of multi-color spaces, and transformer-based models to achieve high accuracy on datasets like EyeQ and FQS.
- RIQA plays a critical role in clinical pipelines by filtering suboptimal images, enhancing diagnostic accuracy in tasks such as diabetic retinopathy grading and lesion segmentation.
Retinal image quality assessment (RIQA) is the algorithmic evaluation of the suitability of retinal images for clinical diagnosis or automated analysis, encompassing both color fundus photography and retinal imaging modalities such as optical coherence tomography angiography (OCTA). Accurate RIQA is critical, as suboptimal image quality may obscure anatomical landmarks and pathological features, degrade the reliability of downstream tasks (e.g., diabetic retinopathy grading, lesion segmentation), and impair clinical workflows. Modern RIQA leverages large-scale datasets, deep neural architectures, classical image priors, multi-attribute frameworks, and semi-supervised paradigms to provide robust, interpretable, and clinically actionable assessments.
1. Datasets, Quality Annotation Protocols, and Schemes
Datasets for RIQA are constructed to reflect the diversity of acquisition devices, pathologies, and operator skill, with expert-annotated quality labels at varying granularity. Large, multi-source benchmarks such as EyeQ (28,792 images, triple-grade Good/Usable/Reject), FQS (2,246 images, continuous mean opinion score (MOS) 0–100 and three-level label), and OCTA-25K-IQA-SEG (25,665 images, three classes for OCTA) have established new standards for scale and consensus annotation (Fu et al., 2019, Gong et al., 19 Nov 2024, Wang et al., 2021). Annotation protocols typically discard inter-observer conflicts to maximize label reliability (EyeQ), or aggregate MOS using weighted consensus from senior and junior ophthalmologists (FQS) (Gong et al., 19 Nov 2024).
Quality criteria reflect clinical use-cases. For color fundus, key image attributes include field coverage, visibility/clarity of optic disc and macula, vessel discernibility, illumination homogeneity, sharpness/focus, color fidelity, and absence of artifacts. Structured frameworks such as FundaQ-8 operationalize this as an eight-attribute, Likert-scale checklist: resolution, field of view, color, artifacts, vessel/macula/disc/cup visibility, with overall quality Q normalized as Q = S/16 for sum S over all axes (Zun et al., 25 Jun 2025). OCTA datasets define quality via gradability (blurring, noise, centering, vessel contrast), with three discrete levels (Wang et al., 2021).
Table: Major RIQA Datasets and Annotation Protocols
| Dataset | Images | Label Granularity | Annotation Scheme |
|---|---|---|---|
| EyeQ | 28,792 | Good / Usable / Reject | Dual-expert consensus, ambiguous removed |
| FQS | 2,246 | MOS (0–100), 3 classes | 6 raters, weighted mean, 10× CV |
| OCTA-25K-IQA-SEG | 25,665 | Ungradable / Gradable / Outstanding | 3-expert consensus, adjudication |
| RIQA-RFMiD | 2,560 | Good / Usable / Reject | Ophthalmologist re-annotation |
| EyeQ-D (subset) | 160 | Per-detail (illumination/clarity/contrast) | 8 ophthalmologists, majority vote |
2. Model Architectures and Image Priors
Early RIQA relied on hand-crafted features (blur, contrast, field-of-view metrics) and shallow classifiers. Deep learning methods now predominate, including convolutional neural networks (CNNs), color-space fusion networks, and transformer-based regression frameworks.
Color Fundus
State-of-the-art models include:
- MCF-Net: Multi-color-space fusion, fusing RGB, HSV, and LAB representations within a DenseNet-121 backbone; DenseNet121-MCF attains 91.75% accuracy and macro F1=0.8551 on EyeQ (Fu et al., 2019).
- GuidedNet: Incorporates fixed dark/bright channel priors in a DenseNet-121, improving discrimination of images with uneven illumination; achieves 89.23% accuracy, F1=88.03% on EyeQ, outperforming prior RGB-only designs (Xu et al., 2020).
- SalStructIQA: Augments CNNs with large-size (optic disc, exudates) and tiny-size (vessel) salient structure priors, yielding dual- or single-branch designs; dual-branch DenseNet-121 reaches 88.97% accuracy, F1=87.23% (Xu et al., 2020).
- QuickQual: Utilizes a frozen DenseNet-121 as perceptual feature extractor, with SVM or ten-parameter “MEga Minified Estimator” (MEME) classifier; SVM achieves 88.63% accuracy, AUC=0.9687, outperforming larger models (Engelmann et al., 2023).
- FundaQ-8: Fine-tuned ResNet-18 regresses a continuous composite quality score Q, driven by explicit multi-attribute labels; RMSE=0.1473, R²=0.7734 (Zun et al., 25 Jun 2025).
- FTHNet: Transformer-based backbone and hypernetwork predict continuous MOS, with PLCC=0.9442, SRCC=0.9358, outperforming prior regression/classification baselines (Gong et al., 19 Nov 2024).
OCTA
Best classifiers are ImageNet-pretrained deep nets (Swin Transformer-Large for 3×3 mm², SE-ResNeXt-101 for 6×6 mm²); Swin achieves accuracy 0.91, AUC=0.98 on “3×3”(Wang et al., 2021).
3. Quality Assessment Algorithms: Losses, Outputs, and Training Protocols
Training protocols are standardized across deep learning RIQA: datasets split into stratified train/val/test folds, class-imbalance handled by weighted cross-entropy or hinge-loss, and data augmentations include flips, rotations, and color jitter.
- Classification Losses: Most architectures use cross-entropy for multiclass outputs; binary classifiers (e.g., AlexNet (Saha et al., 2017)) use hinge-loss.
- Regression Losses: FundaQ-8 employs mean squared error for continuous Q; FTHNet applies the smooth L¹ loss for MOS prediction (Zun et al., 25 Jun 2025, Gong et al., 19 Nov 2024).
- Multi-task and Semi-supervised: Recent paradigms inject pseudo-labels for quality details (illumination/clarity/contrast) alongside overall assignment (ResNet-18 backbone), leading to statistically significant F1 improvements (MT-EyeQ F1=0.875 vs ST-EyeQ F1=0.863, p<0.05) (Telesco et al., 17 Nov 2025).
Parameter counts, FLOPS, and batch sizes vary (DenseNet-121: ~7M params; FTHNet-L: 14.88M). Vision transformers offer global context modeling for complex degradation (Gong et al., 19 Nov 2024), while SVMs/logistic regressors yield ultra-lightweight deployments (Engelmann et al., 2023).
4. Evaluation Protocols and Quantitative Benchmarks
Performance metrics include overall accuracy, classwise precision/recall/F1, ROC AUC, RMSE, PLCC, SRCC, regression R², and explained variance.
| Model | Dataset | Accuracy | F1 | AUC | RMSE | PLCC | SRCC |
|---|---|---|---|---|---|---|---|
| MCF-Net (DenseNet121-MCF) | EyeQ | 91.75% | 0.8551 | - | - | - | - |
| GuidedNet | EyeQ | 89.23% | 0.8803 | - | - | - | - |
| SalStructIQA-dual | EyeQ | 88.97% | 0.8723 | - | - | - | - |
| QuickQual (SVM) | EyeQ | 88.63% | 0.8675 | 0.9687 | - | - | - |
| FundaQ-8 (ResNet18) | FundaQ-8 | - | - | - | 0.1473 | - | - |
| FTHNet-L | FQS | - | - | - | 7.024 | 0.9442 | 0.9358 |
| Swin-Transformer-L | OCTA-25K 3×3 | 91% | 0.91 | 0.98 | - | - | - |
Metrics for continuous MOS/quality scores (PLCC, SRCC) provide robust rank-based assessment in regression settings (FTHNet, FundaQ-8) (Gong et al., 19 Nov 2024, Zun et al., 25 Jun 2025).
A consistent trend is that advanced prior-guided, fusion, or transformer-based methods outperform baseline RGB-only CNNs and classical non-deep approaches (e.g., BRISQUE PLCC=0.9220 vs FTHNet-L PLCC=0.9442) (Gong et al., 19 Nov 2024).
5. Integration into Clinical and Automated Pipelines
Quality gates are positioned as upstream blocks in automated diabetic retinopathy (DR) classifiers or OCTA analysis. Empirical results confirm that restricting analysis to “Good” quality images increases DR detection and segmentation accuracy (DR accuracy rises from 0.5464 in Bad to 0.7357 in Good per FundaQ-8 (Zun et al., 25 Jun 2025); FAZ segmentation Dice jumps from 0.80 to 0.95 after filtering ungradable OCTA (Wang et al., 2021)). Some models provide interpretable outputs (e.g., FundaQ-8’s eight axes, SalStructIQA’s interpretable attention maps, multi-task defect prediction of illumination, clarity, and contrast (Telesco et al., 17 Nov 2025)), enabling actionable feedback for operators.
Real-time applicability varies: ultra-compact QuickQual-MEME runs in ≈15 ms per image, FTHNet-L in ≈56 ms, supporting on-the-fly acquisition feedback (Engelmann et al., 2023, Gong et al., 19 Nov 2024). SOTA transformer-based and DenseNet classifiers are being embedded into commercial and research screening devices.
6. Current Limitations, Controversies, and Future Directions
Generalization to new populations, cameras, and extreme pathology or artifact cases is an ongoing concern, with most models trained on select clinical sources or competitions (Fu et al., 2019, Gong et al., 19 Nov 2024). Some priors (dark/bright channel, SalStructIQA) do not capture all clinically relevant distortions (e.g., blur, motion, artifacts), motivating hybrid frameworks and multi-attribute or defect-specific supervision (Xu et al., 2020, Xu et al., 2020). Inter-rater consistency, MOS granularity, and label noise remain open questions—FTHNet and FundaQ-8 point to future directions with denser, more diverse MOS annotation and domain-adaptive training (Gong et al., 19 Nov 2024, Zun et al., 25 Jun 2025). Multi-task, semi-supervised, and regression approaches promise improvements in both interpretability and data efficiency (Telesco et al., 17 Nov 2025).
Open development and public release of datasets (EyeQ, FQS, RFMiD, OCTA-25K-IQA-SEG) are accelerating cross-comparisons and external validation (Gong et al., 19 Nov 2024, Wang et al., 2021). Emerging topics include multi-modal/multi-task RIQA, integrated image enhancement, device-level QA deployments, and federated or adversarial methods for domain adaptation.