Scalable Binary DR Screening Workflows

Updated 2 January 2026

The paper introduces modular workflows integrating preprocessing, CNN-based grading, and explainability to achieve high sensitivity and specificity in DR detection.
Scalable binary DR screening pipelines leverage model fusion and uncertainty estimation to deliver robust, high-throughput performance in diverse clinical settings.
Resource-efficient designs with modular integration reduce clinician workload while enhancing real-world auditability and telemedicine interoperability.

Scalable binary diabetic retinopathy (DR) screening workflows are algorithmic systems and integrated pipelines designed for automated, high-throughput, and resource-efficient identification of referable DR on retinal fundus images. These workflows enable mass screening for diabetic eye disease, particularly in contexts where ophthalmological expertise is limited. Key developments leverage deep learning—often combined with model fusion, robust preprocessing, and explainability tools—to optimize sensitivity, specificity, interpretability, and scalability across both high-volume and resource-constrained environments.

1. Workflow Architecture and System Components

Contemporary scalable binary DR screening systems share a modular architecture comprising standardized input acquisition, preprocessing, automated grading/classification, explainability and quality assurance, referral decision logic, and integration into clinical or telemedicine environments (Dey et al., 10 Jan 2025, Pinto et al., 2024). Typical high-level pipeline:

Image Acquisition: Portable non-mydriatic fundus cameras capture high-resolution macula-centric retinal images.
Preprocessing: Images are standardized in size, normalized in color, and enhanced for lesion visibility (e.g. via CLAHE or intensity stretching) (Dey et al., 10 Jan 2025, Pinto et al., 2024).
Automated DR Grading: CNN-based models (e.g. EfficientNet-B0, ResNet34, custom Inception variants) classify images as referable vs. non-referable DR (ICDR thresholding) or no-DR vs. DR, optionally incorporating additional gradability or field selection modules (Ahmed et al., 23 Jul 2025, Pinto et al., 2024).
Quality Control: Automated detection of ungradable images leads to recapture or expert triage.
Explainability and Reporting: Techniques such as attention maps, integrated gradients, or hybrid feature explanations enhance clinical transparency (Araújo et al., 2019, Pinto et al., 2024).
Referral and Data Logging: Results (binary decisions, probability/confidence, and rationale) are stored and routed to clinical endpoints for human review or direct patient management.

Component	Model/Algorithm	Input Size
Field Classifier (FC)	ResNet34	150×150
Gradability (GC)	ResNet34	400×400
DR Classifier (DRC)	ResNet34	700×700

Decision logic merges gradability and DR predictions via calibrated scores and a piecewise linear mapping, resulting in a unified screening score with fixed sensitivity/specificity properties.

2. Model Architectures, Training, and Fusion Approaches

Various model paradigms are deployed to meet the scalability and performance requirements of binary DR screening.

Convolutional Neural Networks (CNNs)

EfficientNet-B0 and ResNet34 are widely used for their balance of accuracy and inference speed (Ahmed et al., 23 Jul 2025, Islam et al., 26 Dec 2025):

EfficientNet-B0: ≈5.3M parameters, ≈0.39 GFLOPs, 4–6 ms/image GPU inference; AUC ≈98.6%
ResNet34: ≈21.8M parameters, ≈3.7 GFLOPs, 12–15 ms/image GPU inference; AUC ≈99.4%

Model Fusion

Feature-level fusion of complementary backbones (e.g., EfficientNet-B0 + DenseNet121) enhances generalization, yielding accuracy ≈82.9% (Eff+Den), balanced class-wise F₁-scores, and per-image GPU latency ≈2.3 ms for high-throughput scenarios (Islam et al., 26 Dec 2025).

Hybrid and Interpretable Architectures

HOG-CNN integrates Histogram of Oriented Gradients (HOG) features with CNN embeddings, resulting in high accuracy (98.5%) and AUC (99.2%) on binary DR with minimal hardware footprint and suitability for edge devices (Ahmed, 29 Jul 2025).

Uncertainty and Explainability

DR $|$ GRADUATE augments CNNs with uncertainty estimation and lesion attention maps (Multiple Instance Learning), allowing uncertainty-based triage and improved QC/outlier detection. Binary referral is derived from cumulative grade probabilities (P_ref) (Araújo et al., 2019).

3. Preprocessing, Augmentation, and Quality Control

Robust preprocessing and augmentation are essential for cross-site generalizability and error minimization:

Resizing: All models resize images to standard input (typically 224×224 or 640×640 px) (Islam et al., 26 Dec 2025, Ahmed, 29 Jul 2025).
Normalization: Color normalization to ImageNet statistics or dataset-specific means (Ahmed et al., 23 Jul 2025, Islam et al., 26 Dec 2025).
Augmentation: Balanced synthetic augmentation using geometric transformations, perspective, brightness/contrast jitter, occlusion, and more achieves class balance and reduces overfitting (Ahmed et al., 23 Jul 2025).
Quality Assessment: Automated detection of blur, field-of-view artefacts, and low signal-to-noise via QC networks or heuristic tests, with triage for re-acquisition or manual review (Pinto et al., 2024, Dey et al., 10 Jan 2025).
Ungradable Handling: Images determined non-gradable by QC modules are auto-referred or recaptured, ensuring high NPV (Dey et al., 10 Jan 2025).

4. Screening Decision Logic, Metrics, and Thresholding

All workflows ultimately convert per-image or per-eye predictions into binary decisions. Standard definitions (Dey et al., 10 Jan 2025, Barakat et al., 16 Sep 2025):

Sensitivity (Recall): $\mathrm{Sensitivity} = \frac{TP}{TP + FN}$
Specificity: $\mathrm{Specificity} = \frac{TN}{TN + FP}$
Accuracy: $\frac{TP + TN}{TP + TN + FP + FN}$
F₁-score: $2\,\frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$
AUC (AUROC): Area under ROC curve, via numerical integration (Barakat et al., 16 Sep 2025).

Threshold optimization involves Youden’s J statistic or cost-aware minimization (e.g., $C_{\mathrm{FN}}=10\times C_{\mathrm{FP}}$ ) (Dey et al., 10 Jan 2025). Calibration techniques (beta, isotonic) are used to ensure clinical interpretability and transferability (Pinto et al., 2024).

Binary decision rules are typically:

Refer if model $p(\mathrm{Referable}) \geq \tau$ for chosen $\tau$
Compound logic: refer if gradability $<$ threshold or DR score $>$ threshold (Pinto et al., 2024)

In explainable systems, uncertainty or low-confidence cases can be triaged for expert review regardless of model output (Araújo et al., 2019).

5. Scalability, Hardware Footprint, and Deployment

Scalability dimensions include computation, storage, workflow parallelization, and integration with real-world clinics:

Inference Speed: Top models (EfficientNet-B0, HOG-CNN) deliver <6 ms/image GPU or real-time edge (Jetson Nano, Apple M2) inference; batch processing yields >400 images/sec on standard GPUs (Islam et al., 26 Dec 2025, Ahmed, 29 Jul 2025).
Memory and Model Size: Lightweight architectures and quantization reduce RAM and disk to <100 MB, feasible for on-premise or edge deployment (Ahmed et al., 23 Jul 2025, Ahmed, 29 Jul 2025).
Batching & Throughput: Batch processing (32–1000 images) and GPU queueing maximize throughput for screening camps/mass uploads (Islam et al., 26 Dec 2025).
Resilience: Local inference removes cloud dependency; error handling, result logging, and continuous drift monitoring are recommended (Dey et al., 10 Jan 2025).
Integration: RESTful APIs, DICOM/JPEG+JSON interfacing, telemedicine dashboards, and human-in-the-loop feedback loops ensure interoperability (Dey et al., 10 Jan 2025, Pinto et al., 2024).

In-field hybrid workflows can leverage two-stage “describe-then-refer” strategies with vision–LLMs like MedGemma, enabling explainable outputs and cascading referral decisions for improved safety (Barakat et al., 16 Sep 2025).

6. Clinical Integration and Real-World Impact

Deployment of binary DR screening workflows in large-scale clinical settings has demonstrated:

Workload Reduction: Autonomous pre-screening with systems such as NaIA-RD can reduce clinician review burden by over 4× without loss of sensitivity for sight-threatening DR (Pinto et al., 2024).
Sensitivity and Specificity: State-of-the-art pipelines achieve sensitivity >92%, specificity >92% (NaIA-RD), with AUC often exceeding 0.98 across multiple public datasets (Dey et al., 10 Jan 2025, Pinto et al., 2024).
GP Workflow Enhancement: In clinical studies, AI-assisted screening increased GP sensitivity from 40–80% to >90%, with a measurable increase in downstream detection rates (Pinto et al., 2024).
Clinical Audit and Monitoring: Periodic audits on random samples of “no-refer” cases, versioning, and feedback loops are key for sustained safety and trust (Barakat et al., 16 Sep 2025, Pinto et al., 2024).

A plausible implication is that continuous integration of feedback from clinicians and real-world patient data, combined with modular and explainable AI, will increasingly homogenize screening standards, improve detection rates, and optimize resource use in both high- and low-resource settings.

7. Explainability, Trust, and Model Governance

Clinical acceptance relies on explainable outputs, uncertainty estimation, and transparent decision-making paths:

Descriptive Outputs: Rich lesion-level explanations (textual MedGemma outputs, attention maps, heatmaps) can both boost downstream model AUROC and increase clinical trust; e.g., GPT-4o's AUROC increased from ~0.78 (image) to ~0.96 (text description only) (Barakat et al., 16 Sep 2025).
Uncertainty Quantification: DR $|$ GRADUATE provides an uncertainty score per image, flagging low-confidence predictions for manual review and supporting outlier/quality control (Araújo et al., 2019).
Human-in-the-Loop: Flagged uncertain or borderline images are routed to experts, and clinician overrides are directly reintegrated for periodic retraining and governance (Dey et al., 10 Jan 2025, Pinto et al., 2024).
Version Control and Audit: Model runs should be tagged, and drift-detection mechanisms deployed to trigger retraining when performance metrics change by >5 points (Barakat et al., 16 Sep 2025).

Collectively, these explainable and governed workflows are foundational for regulatory approval, long-term safety, and transparent deployment of binary DR screening systems at scale.