Cross-Modal Fundus Registration

Updated 21 December 2025

CMFIR is the process of spatially aligning retinal images from differing modalities by leveraging anatomical features, particularly vascular structures.
Advanced methods combine feature matching, deep neural networks, and contrastive representation learning to address challenges like intensity differences, FoV discrepancies, and nonrigid deformations.
Integrating physiological cropping and hybrid registration techniques enhances accuracy, which is crucial for multimodal retinal disease assessment and automated analysis pipelines.

Cross-modal fundus image registration (CMFIR) is the problem of spatially aligning retinal images acquired in different imaging modalities, such as color fundus photography (CFP), fluorescein angiography (FA), and optical coherence tomography angiography (OCTA). CMFIR exploits anatomical correspondences, predominantly vascular structure, but is challenged by significant cross-modal appearance variation, field-of-view (FoV) discrepancies, and potential non-rigid tissue deformations. Accurate CMFIR is central for integrated multimodal retinal disease assessment, progression tracking, and automated analysis pipelines (Li et al., 14 Dec 2025, Sindel et al., 2022, Hervella et al., 2018, Sindel et al., 2022, Pielawski et al., 2020).

1. Problem Formulation and Mathematical Foundations

Given two images of the same retina acquired in different modalities, denote the source $I_s$ and the target $I_t$ . The registration problem is to find a spatial transformation

$T: (u, v) \in I_s \mapsto (x, y) \in I_t$

that aligns each pixel in $I_s$ to its anatomically correct correspondent in $I_t$ . For practical CMFIR, especially between modalities such as OCTA ( $\mathrm{FoV} \approx 40^\circ$ ) and wide-field CFP ( $\mathrm{FoV} \approx 90^\circ$ ), the registration must handle severe non-linearities and large FoV disparities (Li et al., 14 Dec 2025).

A widely adopted pipeline is feature-based: detect matched keypoint pairs

$P = \{ ((u_i, v_i), (x_i, y_i)) \}_{i=1}^m$

and estimate $T$ via least-squares or robust polynomial fitting. Alternatively, intensity-based and hybrid formulations (combining feature matches with image similarity metrics) are found in the literature (Hervella et al., 2018).

2. Classical and Hybrid CMFIR Approaches

Initial work in CMFIR primarily utilized vessel bifurcations and cross-overs as landmarks, with vessel enhancement to mitigate cross-modal intensity differences. Hervella et al.'s framework employs:

Vessel segmentation with multi-scale curvature operators (MLSEC-ST) distinguishing valleys (CFP) and ridges (FA).
Extraction of domain-specific landmarks (bifurcations, cross-overs) with orientation and spatial consistency checks.
RANSAC-based feature matching restricted to similarity or affine transforms for initial coarse alignment.
An intensity-based refinement using vessel-enhanced normalized cross-correlation (VE-NCC), allowing further non-rigid fine-tuning via free-form deformation (FFD) under a composite regularized energy functional.

The hybrid strategy achieves VE-NCC values up to $0.61 \pm 0.08$ (healthy cases), outperforming pure feature- or intensity-based approaches. The feature-based registration (FBR) reliably produces robust initial alignment, with intensity-based refinement necessary for fine registration, especially in the presence of pathologies (Hervella et al., 2018).

3. Deep Learning and Keypoint-Based CMFIR

Recent methodologies utilize deep neural networks trained to detect and describe vessel-structure keypoints invariant across modalities.

RetinaCraquelureNet is a fully convolutional neural network with a shared ResNet backbone, producing parallel confidence and descriptor maps. Key technical components include:

Joint training on classification (vessel vs background) and bidirectional quadruplet descriptor losses, supervised via annotated cross-modal keypoint pairs.
Homography model for geometric transformation, robustly estimated via RANSAC.
Achieves 100.0% mean- and max-error success rates under strict thresholds (SR_ME, SR_MAE) in public multimodal datasets, significantly outperforming SuperPoint, GLAMPoints, and classic SIFT-type methods (Sindel et al., 2022).

This vessel-structure-aware detection paradigm is further extended in KPVSA-Net, which integrates end-to-end keypoint detection/description and matching using a SuperGlue-style Graph Neural Network. Synthetic homography-based self-supervision, photometric/geometric augmentations, and explicit vessel-structure cues are critical for robustness and generalization across modalities and datasets (Sindel et al., 2022). KPVSA-Net consistently delivers top accuracy on synthetic and real datasets, e.g., mean Euclidean error 1.50 ± 0.36 px and Dice 0.659 ± 0.09 in CFP–FA registration.

4. Contrastive Representation Learning for CMFIR

Contrastive Multimodal Image Representation (CoMIR) reduces multimodal registration to a monomodal task by projecting images from each modality into a learned common space. The core elements are:

Independent U-Net-based encoders per modality, producing spatially dense "CoMIR" representations from local image patches.
Training by InfoNCE contrastive loss with a mean-squared-error critic, applied to aligned patch pairs, with all other patches serving as negatives.
Rotation equivariance enforced by random 90°-step rotations at both input and representation levels, with no extra hyperparameters.
At inference, standard monomodal registration algorithms (e.g., α-AMD for intensity-based, SIFT+RANSAC for feature-based) are applied to the CoMIRs.

Empirically, CoMIRs enable accurate registration even between modalities with minimal apparent correlation and outperform GAN translation methods and application-specific handcrafted approaches (Pielawski et al., 2020). A plausible implication is that contrastive learned representations may be preferred when no dense correspondences are directly accessible in the raw data.

The CARe ("Crop and Alignment for cross-modal fundus image Registration") framework explicitly addresses cases with large FoV disparity (e.g., OCTA vs. wide-field CFP) that defeat naive application of standard CMFIR techniques (Li et al., 14 Dec 2025). The pipeline consists of:

Unified vessel segmentation using a U-Net trained on multiple fundus modalities to produce anatomically consistent vessel maps.
Keypoint detection and description via SuperRetina, trained with an asymmetric morphological "opening" to minimize OCTA–CFP domain gaps, matched using brute-force nearest neighbor search on descriptor vectors.
Physiology-driven "Crop" operation: RetinaNet first localizes the macula and optic disc (OD) in the wide-field target; a square around the macula with side $2r$, where $r$ is the distance to the OD edge, is extracted to spatially match the source FoV.
A double-fitting "RAN-Poly" alignment: RANSAC homography to reject outliers, followed by least-squares fitting of degree-2 bivariate polynomials for each coordinate, accommodating nonlinear mapping from the spherical retinal surface to planar images.

Experimental results on a dataset of 60 OCTA–wfCFP pairs ("OCTA60") show state-of-the-art performance: 0.0% failure, 1.7% inaccurate, 98.3% acceptable registrations, with AUC 0.920 (cumulative error curve) and Dice score 0.295. Ablation confirms the necessity of the Crop operation and the double-fitting: without Crop, acceptable registration drops to 1.7%; omitting polynomial refinement or RANSAC degrades both accuracy and robustness. Degree-2 polynomials offer the optimal bias-variance trade-off in this context.

Key limitations are dependency on precise macula/OD localization, suboptimal keypoint density in noise/pathology, and polynomial mapping's inability to fully account for highly local distortions. The method is compatible with standard OpenCV and PyTorch pipelines (Li et al., 14 Dec 2025).

6. Comparative Overview of Methods

Approach	Key Principle	Robustness to Large FoV	Nonlinear Deformation Handling
Landmark Hybrid	Domain-specific keypoints + VE-NCC (Hervella et al., 2018)	Limited	Yes (FFD stage)
RetinaCraquelureNet	DNN keypoints & descriptors (Sindel et al., 2022)	Limited	No (homography only)
KPVSA-Net	DNN detection + SuperGlue matcher (Sindel et al., 2022)	Limited	No (homography only)
CoMIR	Contrastive rep. learning, monomodal registration (Pielawski et al., 2020)	Not directly tested	Depends on downstream method
CARe	Physiology-aligned Crop + RAN-Poly (Li et al., 14 Dec 2025)	Explicitly addressed	Partial (global polynomial)

CARe uniquely targets large cross-modal FoV discrepancies by foreshortening the target to the source's physiological region. In landmark-based and DNN feature approaches, performance degrades when FoV alignment is poor or when nonlinearities exceed the transformation model.

7. Outstanding Challenges and Future Directions

Unresolved issues across the field include:

Accurate detection of physiological landmarks (macula, OD) is a limiting factor in physiologically constrained cropping.
Vessel-structure-based keypoint detection remains susceptible to noise, imaging artifacts, and pathology, potentially causing sparse or unreliable matches.
Polynomial and homography-based models do not capture fine, highly localized distortions resulting from acquisition, motion, or variable tissue compliance.
End-to-end frameworks for nonrigid alignment, such as incorporating thin-plate spline or piecewise affine fields, remain an open area for investigation (Sindel et al., 2022, Sindel et al., 2022).
Domain adaptation and self-supervision (e.g., CycleGAN or contrastive learning) help address domain gaps but may propagate artifacts or style-transfer errors.
Real-time implementation on resource-constrained hardware remains challenging for DNN-based methods and polynomial fitting at scale (Li et al., 14 Dec 2025).

A plausible implication is that future research will focus on domain-robust feature learning, locally adaptive nonrigid registration fields, and integrated reliability estimation, with physiological priors and interpretable vessel-structure matching central to robust clinical deployment.