XoFTR: Cross-modal Feature Matching Transformer

Updated 4 July 2026

XoFTR is a cross-modal, detector-free, transformer-based feature matcher designed for visible–thermal and SAR–optical registration.
It employs a coarse-to-fine matching pipeline with paired masked image modeling pre-training and pseudo-thermal augmentation to tackle severe modality gaps.
Sub-pixel refinement and many-to-many coarse assignment yield robust correspondences, achieving state-of-the-art performance on visible–thermal benchmarks.

Searching arXiv for the cited XoFTR and SAR–optical registration papers to ground the article in the latest available preprints. XoFTR, short for Cross-modal Feature Matching Transformer, is a detector-free local feature matcher designed for cross-modal, cross-view correspondence estimation between visible and thermal infrared (TIR) images. It extends the LoFTR family with paired masked image modeling pre-training, pseudo-thermal augmentation, a redesigned coarse-to-fine matching pipeline, and sub-pixel refinement, with the explicit aim of handling viewpoint, scale, and texture variation under severe modality gaps. Subsequent work has treated XoFTR as an off-the-shelf baseline for SAR–optical remote-sensing registration, where it serves as a prominent test case for how far visible–thermal cross-modal supervision transfers to a substantially different sensing modality (Tuzcuoğlu et al., 2024).

1. Origin, scope, and problem setting

XoFTR was introduced for visible–TIR local feature matching in settings where both multimodality and multiview variation are present. The underlying problem is to recover correspondences between 2D locations in a visible image $I^A$ and a TIR image $I^B$ , so that downstream geometric tasks such as relative pose estimation, homography estimation, structure-from-motion, and localization can be performed. The method is detector-free: rather than first detecting sparse keypoints and then describing them, it predicts semi-dense correspondences from deep features and transformer-based contextualization (Tuzcuoğlu et al., 2024).

The motivation is rooted in the mismatch between sensing physics and geometric variation. Visible imagery records reflected visible light, whereas TIR imagery records emitted thermal radiation in the $8\text{–}14\,\mu\mathrm{m}$ range. The resulting differences in texture, intensity statistics, field of view, and effective resolution make visible–TIR matching substantially harder than visible–visible matching. XoFTR was proposed in response to the limitations of both handcrafted descriptors and RGB-trained deep matchers, which had difficulty jointly handling non-linear modality differences and large viewpoint or scale changes.

Later remote-sensing studies repositioned XoFTR within a broader taxonomy of matchers. In SAR–optical registration, it is treated as a deep learning–based, detector-free, transformer-based cross-modal matcher and is typically evaluated without architectural changes, fine-tuning, or SAR-specific adaptation. This reuse is important because it isolates transfer: any observed success or failure can be attributed to the learned representation and inference protocol rather than to task-specific retraining (Zhang et al., 3 Feb 2025).

2. Architecture and matching pipeline

XoFTR follows a LoFTR-style coarse–fine transformer design, but its modifications are specifically targeted at cross-modal matching. At test time, the longer side of each image is resized to $640$ pixels. A ResNet-based CNN backbone extracts multi-scale features at $1/8$, $1/4$, and $1/2$ resolution, denoted $F^*_{1/8}$ , $F^*_{1/4}$ , and $F^*_{1/2}$ for each modality $I^B$ 0 (Tuzcuoğlu et al., 2024).

The Coarse-Level Matching Module (CLMM) operates at $I^B$ 1 scale. It applies a LoFTR-style transformer with linear self-attention and cross-attention to produce refined coarse features $I^B$ 2 and $I^B$ 3. The coarse similarity matrix is defined as

$I^B$ 4

Row-wise and column-wise softmax probabilities are then used with an AdaMatcher-style many-to-one/one-to-many/one-to-one assignment. This is a decisive departure from the original LoFTR pipeline, because one-to-many coarse assignment is intended to absorb the scale and viewpoint discrepancies that arise when visible and thermal sensors differ substantially in field of view and resolution.

The Fine-Level Matching Module (FLMM) refines these coarse correspondences to $I^B$ 5 scale. It fuses coarse transformer features with backbone features, extracts local windows of sizes $I^B$ 6, $I^B$ 7, and $I^B$ 8 around each coarse match, and processes them hierarchically with self-attention and cross-attention. Fine matching is then performed within local $I^B$ 9 windows using dual-softmax probabilities, together with a second confidence threshold $8\text{–}14\,\mu\mathrm{m}$ 0, yielding one-to-one fine correspondences.

The final stage is the Sub-pixel Refinement Module (SPRM). For each fine match, XoFTR concatenates local features from both modalities and regresses offsets with an MLP followed by $8\text{–}14\,\mu\mathrm{m}$ 1: $8\text{–}14\,\mu\mathrm{m}$ 2 These offsets are added to the discrete $8\text{–}14\,\mu\mathrm{m}$ 3-scale coordinates to produce sub-pixel correspondences. The overall design can be summarized as a cross-modal LoFTR derivative in which coarse many-to-many tolerance, multi-scale local refinement, and sub-pixel regression are all explicitly tied to modality-robust matching.

3. Pre-training, augmentation, and optimization

A central feature of XoFTR is that its cross-modal capability is not attributed solely to supervised matching loss. The model first undergoes paired masked image modeling (MIM) on the KAIST Multispectral Pedestrian Detection dataset, which provides $8\text{–}14\,\mu\mathrm{m}$ 4 RGB–TIR image pairs. During MIM, images are cropped to the top $8\text{–}14\,\mu\mathrm{m}$ 5 region, $8\text{–}14\,\mu\mathrm{m}$ 6 of the image is masked using $8\text{–}14\,\mu\mathrm{m}$ 7 patches, masked locations are replaced by learnable mask tokens at feature scales, and the decoder reconstructs the masked regions using an MSE objective (Tuzcuoğlu et al., 2024).

The second ingredient is pseudo-thermal augmentation during matching fine-tuning on MegaDepth. For each RGB pair, the augmentation is applied randomly to one image. The procedure consists of color jitter, grayscale conversion, a randomized cosine transform with parameters $8\text{–}14\,\mu\mathrm{m}$ 8, $8\text{–}14\,\mu\mathrm{m}$ 9, and $640$0, followed by random Gaussian blur with a $640$1 kernel. The purpose is to generate thermal-like intensity transformations without relying on learned image translation, thereby exposing the matcher to broad modality-like variation.

Matching fine-tuning is supervised by three losses: a coarse-level focal loss $640$2, a fine-level focal loss $640$3, and a sub-pixel symmetric epipolar loss $640$4. The total loss is

$640$5

with $640$6, $640$7, and $640$8. Coarse and fine supervision are generated from depth maps and camera poses, while the sub-pixel term uses the essential matrix. The reported training protocol uses Adam, $640$9 learning rate for MIM pre-training over $1/8$0 epochs with batch size $1/8$1, and $1/8$2 learning rate for fine-tuning with batch size $1/8$3. The coarse and fine matching thresholds are $1/8$4 and $1/8$5.

4. Performance on visible–thermal benchmarks

XoFTR was evaluated on the METU-VisTIR benchmark, a visible–thermal dataset collected with a DJI Mavic 3 Thermal drone. The thermal camera has resolution $1/8$6, field of view $1/8$7, and spectrum $1/8$8; the RGB camera has resolution $1/8$9 and field of view $1/4$0. The benchmark contains six outdoor scenes, with separate cloudy–cloudy and cloudy–sunny splits of $1/4$1 and $1/4$2 image pairs respectively (Tuzcuoğlu et al., 2024).

Benchmark	Metric	XoFTR result
METU-VisTIR cloudy–cloudy	AUC@20°	55.06
METU-VisTIR cloudy–sunny	AUC@20°	45.03
LGHD + FusionDN homography benchmark	AUC@20px	48.15

On relative pose estimation, XoFTR substantially outperformed the listed baselines on both METU-VisTIR splits. In cloudy–cloudy, its AUC@20° is 55.06, compared with 19.17 for DKM and 14.11 for LoFTR. In cloudy–sunny, where illumination changes compound the modality gap, XoFTR reaches 45.03 AUC@20°, again well ahead of DKM at 23.60 and LoFTR at 16.36. On homography estimation over the combined LGHD LWIR/RGB and FusionDN RoadScene benchmark with synthetic homographies, XoFTR reaches 48.15 AUC@20px, exceeding LoFTR at 30.23 and ASpanFormer at 36.42 at that threshold.

The ablation study identifies the principal contributors to this performance. On the cloudy–sunny split, removing MIM pre-training lowers AUC@20° from 45.03 to 42.93. Removing pseudo-thermal augmentation causes a much larger drop, from 45.03 to 14.94. Replacing one-to-many coarse assignment with one-to-one reduces AUC@20° to 38.73, and reverting to LoFTR’s original coarse-to-fine design reduces it to 26.77. Removing sub-pixel refinement changes AUC@20° from 45.03 to 44.68, indicating a modest but consistent benefit, while removing the fine-level threshold or positional bias yields 43.56 and 43.36 respectively. Runtime on an A5000 at $1/4$3 is reported as approximately 116 ms, compared with 102 ms for LoFTR.

5. Transfer to SAR–optical satellite registration

The main cross-domain question raised in later work is whether a matcher trained for visible–thermal invariance also transfers to SAR–optical satellite registration. In a 24-family zero-shot evaluation on SpaceNet9 and two additional SAR–optical benchmarks, XoFTR is the only matcher in the sweep described as being explicitly trained for cross-modal visible–thermal matching. It is grouped with other cross-modal or multimodal-pretrained methods such as MINIMA-XoFTR, MINIMA-RoMa, and MatchAnything-ELoFTR (Corley et al., 11 Apr 2026).

Under the main SpaceNet9 protocol—affine geometry via OpenCV estimateAffine2D, 512 px tiles with 256 px overlap, maximum long side 1024 px, RANSAC threshold 3.0 px, minimum 4 inliers, and percentile normalization for XoFTR—its performance is among the best reported: 3.0 px mean tie-point error, 78.4% Success@5, 90.5% Success@10, and 0.00 failure rate. This mean error is tied with RoMa, which also attains 3.0 px but does so without cross-modal training. MatchAnything-ELoFTR follows at 3.4 px, while LoFTR records 5.1 px mean error. The resulting comparison is central to the paper’s notion of asymmetric transfer: explicit cross-modal supervision helps, but it does not uniformly dominate strong single-modality or synthetic cross-modal alternatives.

Protocol sensitivity is a major part of the interpretation. The four-stage evaluation pipeline includes preprocessing, tiled correspondence extraction, geometric filtering, and transform-based error computation on tie points. Across ablations, geometry model, tile size, overlap, RANSAC threshold, and inlier gating can move accuracy substantially. In the broader sweep, affine geometry alone reduces mean error from $1/4$4 to $1/4$5 px, reinforcing that deployment choices can rival or exceed matcher-to-matcher differences. XoFTR is reported as comparatively stable under threshold sweeps and tiled affine fitting: it is less fragile than repurposed 3D-reconstruction matchers such as MASt3R and DUSt3R, yet still materially affected by protocol design. On SRIF and SARptical, XoFTR remains competitive but is not highlighted as the leading method.

6. Multi-resolution SAR–optical evidence, limitations, and research directions

A distinct picture emerges from the MultiResSAR benchmark, a dataset of 10,850 SAR–optical pairs spanning Sentinel-1, HT1-A, GF-3, and Umbra data, with resolutions from 10 m down to 0.16 m and scenes including urban, rural, plains, hills, mountains, and water. Here XoFTR is evaluated as an off-the-shelf deep matcher using public code and author-recommended parameters, again without SAR-specific redesign or fine-tuning (Zhang et al., 3 Feb 2025).

On the full benchmark, XoFTR is reported as the best deep learning method by two of the four aggregate metrics: Success Rate (SR) 40.58% and RMSE 3.03 pixels, together with NCM 244.26 and TM 0.032 s per pair. These numbers place it ahead of other deep baselines such as RoMa (SR 35.26%, RMSE 3.15) and XFeat (SR 36.29%, RMSE 4.94). At the same time, the traditional RIFT method is more robust in aggregate, with SR 66.51%, though at higher RMSE (3.58 px) and much greater runtime (5.283 s). The resulting contrast is precise: XoFTR is fast and geometrically accurate when successful, but its cross-scene robustness remains limited relative to the strongest handcrafted SAR-aware baseline.

The most severe limitation appears at sub-meter SAR resolution. MultiResSAR includes an Umbra–GE subset of 850 pairs at 0.16 m SAR resolution. The study states that on sub-meter image pairs, nearly all algorithms fail, and in the four illustrated groups only RoMa achieves correct registration. XoFTR is therefore part of the broader failure mode in ultra-high-resolution SAR, where fine speckle structure, layover, shadow, and strong 3D effects break the assumptions of 2D cross-modal matching.

These results argue against a common simplification that cross-modal pretraining alone should solve remote-sensing transfer. A more defensible reading is that XoFTR validates the utility of explicit cross-modal supervision—particularly through its improvement over LoFTR and its strong SpaceNet9 results—while also revealing that robustness on SAR–optical data depends on additional factors: sensing-physics mismatch, geometric verification, dataset coverage, and scale. The cited studies converge on several open directions: noise suppression, SAR-aware feature learning, 3D geometric fusion, cross-view transformation modeling, and domain-specific deep learning optimization such as fine-tuning on datasets like MultiResSAR or SpaceNet9. Within that landscape, XoFTR occupies a specific position: a strong cross-modal transformer baseline whose design transfers surprisingly well beyond visible–thermal matching, but whose limitations remain pronounced in the hardest SAR regimes (Zhang et al., 3 Feb 2025).