Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 102 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 25 tok/s
GPT-5 High 35 tok/s Pro
GPT-4o 99 tok/s
GPT OSS 120B 472 tok/s Pro
Kimi K2 196 tok/s Pro
2000 character limit reached

XoFTR: Cross-modal Feature Matching Transformer (2404.09692v1)

Published 15 Apr 2024 in cs.CV

Abstract: We introduce, XoFTR, a cross-modal cross-view method for local feature matching between thermal infrared (TIR) and visible images. Unlike visible images, TIR images are less susceptible to adverse lighting and weather conditions but present difficulties in matching due to significant texture and intensity differences. Current hand-crafted and learning-based methods for visible-TIR matching fall short in handling viewpoint, scale, and texture diversities. To address this, XoFTR incorporates masked image modeling pre-training and fine-tuning with pseudo-thermal image augmentation to handle the modality differences. Additionally, we introduce a refined matching pipeline that adjusts for scale discrepancies and enhances match reliability through sub-pixel level refinement. To validate our approach, we collect a comprehensive visible-thermal dataset, and show that our method outperforms existing methods on many benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (90)
  1. Multipoint: Cross-spectral registration of thermal and optical aerial imagery. In Conference on Robot Learning, pages 1746–1760. PMLR, 2021.
  2. Lghd: A feature descriptor for matching across non-linear intensity variations. In Image Processing (ICIP), 2015 IEEE International Conference on, page 5. IEEE, 2015.
  3. Unsupervised multi-modal image registration via geometry preserving image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13410–13419, 2020.
  4. Multimae: Multi-modal multi-task masked autoencoders masked inputs multimae predictions target masked inputs multimae predictions target masked inputs multimae predictions target, 2023.
  5. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In CVPR, 2017.
  6. Beit: Bert pre-training of image transformers. In ICLR 2022 - 10th International Conference on Learning Representations, 2022.
  7. Multimodal matching using a hybrid convolutional neural network. PhD thesis, Ben-Gurion University of the Negev, 2018.
  8. Joint detection and matching of feature points in multimodal images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6585–6593, 2021.
  9. Surf: Speeded up robust features. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9, pages 404–417. Springer, 2006.
  10. Visible and infrared image registration using trajectories and composite foreground images. Image and Vision Computing, 29(1):41–50, 2011.
  11. Geometry-aware learning of maps for camera localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2616–2625, 2018.
  12. Brief: Binary robust independent elementary features. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11, pages 778–792. Springer, 2010.
  13. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  14. Aspanformer: Detector-free image matching with adaptive span transformer. In European Conference on Computer Vision, pages 20–36. Springer, 2022.
  15. Shape-former: Bridging cnn and transformer via shapeconv for multimodal image matching. Information Fusion, 91:445–457, 2023a.
  16. Incomplete multimodal learning for remote sensing data fusion. arXiv preprint arXiv:2304.11381, 2023b.
  17. Cross-modality image matching network with modality-invariant feature representation for airborne-ground thermal infrared and visible datasets. IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021.
  18. Redfeat: Recoupling detection and description for multimodal feature learning. IEEE Transactions on Image Processing, 32:591–602, 2022.
  19. Matching as color images: Thermal image local feature detection and description. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1905–1909. IEEE, 2021.
  20. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018.
  21. Borrow from anywhere: Pseudo multi-modal object detection in thermal imagery. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2019-June, 2019.
  22. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  23. Mivi: Multi-stage feature matching for infrared and visible image. The Visual Computer, pages 1–13, 2023.
  24. Elan Dubrofsky. Homography estimation. Diplomová práce. Vancouver: Univerzita Britské Kolumbie, 5, 2009.
  25. D2-net: A trainable cnn for joint description and detection of local features. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 8092–8101, 2019.
  26. Dkm: Dense kernelized feature matching for geometry estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17765–17775, 2023.
  27. Multi-spectral stereo image matching using mutual information. In Proceedings. 2nd International Symposium on 3D Data Processing, Visualization and Transmission, 2004. 3DPVT 2004., pages 961–968, 2004.
  28. Topicfm+: Boosting accuracy and efficiency of topic-assisted feature matching, 2023.
  29. Visible and infrared image registration in man-made environments employing hybrid visual features. Pattern Recognition Letters, 34(1):42–51, 2013.
  30. Dual contrastive learning for unsupervised image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021.
  31. Multiple view geometry in computer vision. Cambridge university press, 2003.
  32. Fastervit: Fast vision transformers with hierarchical attention. arXiv preprint arXiv:2306.06189, 2023.
  33. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  34. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022.
  35. Pos-gift: A geometric and intensity-invariant feature transformation for multimodal images. Information Fusion, 102:102027, 2024.
  36. Infrared-visual image registration based on corners and hausdorff distance. In Image Analysis: 15th Scandinavian Conference, SCIA 2007, Aalborg, Denmark, June 10-14, 2007 15, pages 383–392. Springer, 2007.
  37. Multiscale structural feature transform for multi-modal image matching. Information Fusion, 95:341–354, 2023.
  38. Adaptive assignment for geometry aware local feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5425–5434, 2023.
  39. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1037–1045, 2015.
  40. A review of multimodal image matching: Methods and applications. Information Fusion, 73:22–71, 2021.
  41. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
  42. ThermalGAN: Multimodal Color-to-Thermal Image Translation for Person Re-Identification in Multispectral Dataset. In Computer Vision – ECCV 2018 Workshops. Springer International Publishing, 2018.
  43. Edge-guided multi-domain rgb-to-tir image translation for training vision tasks with challenging labels. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 8291–8298. IEEE, 2023.
  44. Drit++: Diverse image-to-image translation via disentangled representations. International Journal of Computer Vision, 128, 2020.
  45. Rift: Multi-modal image matching based on radiation-variation insensitive feature transform. IEEE Transactions on Image Processing, 29:3296–3310, 2019.
  46. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018.
  47. Land surface emissivity retrieval from satellite data. International Journal of Remote Sensing, 34(9-10):3084–3127, 2013.
  48. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  49. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17627–17638, 2023.
  50. Robust and fast registration of infrared and visible images for electro-optical pod. IEEE Transactions on Industrial Electronics, 66(2):1335–1344, 2018a.
  51. A novel affine and contrast invariant descriptor for infrared and visible image registration. Remote Sensing, 10(4):658, 2018b.
  52. A multi-view thermal–visible image dataset for cross-spectral matching. Remote Sensing, 15(1):174, 2022a.
  53. Swin transformer v2: Scaling up capacity and resolution. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
  54. David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
  55. Attention-based multimodal image matching. Computer Vision and Image Understanding, page 103949, 2024.
  56. Contrastive learning for unpaired image-to-image translation. In European Conference on Computer Vision, 2020.
  57. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
  58. Improving language understanding by generative pre-training. OpenAI, 2018.
  59. R2d2: Reliable and repeatable detector and descriptor. Advances in neural information processing systems, 32, 2019.
  60. Neighbourhood consensus networks. Advances in neural information processing systems, 31, 2018.
  61. Orb: An efficient alternative to sift or surf. In 2011 International conference on computer vision, pages 2564–2571. Ieee, 2011.
  62. From coarse to fine: Robust hierarchical localization at large scale. In CVPR, 2019.
  63. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
  64. Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8601–8610, 2018.
  65. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  66. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
  67. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
  68. Complementary random masking for rgb-thermal semantic segmentation, 2023.
  69. Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931, 2021.
  70. L2-net: Deep learning of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 661–669, 2017.
  71. Rgb-multispectral matching: Dataset, learning methodology, evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15958–15968, 2022.
  72. Gocor: Bringing globally optimized correspondence volumes into your neural network. Advances in Neural Information Processing Systems, 33:14278–14290, 2020a.
  73. Glu-net: Global-local universal network for dense flow and correspondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6258–6268, 2020b.
  74. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  75. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11, 2010.
  76. Infrared and visible image registration using transformer adversarial network. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 1248–1252. IEEE, 2018.
  77. Phil Wang. Bidirectional cross attention. https://github.com/lucidrains/bidirectional-cross-attention, 2022.
  78. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16133–16142, 2023.
  79. Deepmatcher: a deep transformer-based network for robust and accurate local feature matching. Expert Systems with Applications, 237:121361, 2024.
  80. Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24, pages 171–180. Springer, 2021.
  81. Simmim: a simple framework for masked image modeling. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022.
  82. Fusiondn: A unified densely connected network for image fusion. In Proceedings of the AAAI conference on artificial intelligence, pages 12484–12491, 2020.
  83. An iterative adaptive multi-modal stereo-vision method using mutual information. Journal of Visual Communication and Image Representation, 26:115–131, 2015.
  84. Lift: Learned invariant feature transform. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pages 467–483. Springer, 2016.
  85. Synthetic data generation for end-to-end thermal infrared tracking. IEEE Transactions on Image Processing, 28, 2019.
  86. Ibot: Image bert pre-training with online tokenizer. In ICLR 2022 - 10th International Conference on Learning Representations, 2022.
  87. Patch2pix: Epipolar-guided pixel-level correspondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4669–4678, 2021.
  88. Masked autoencoders in computer vision: A comprehensive survey. IEEE Access, 11:113560–113579, 2023.
  89. Pmatch: Paired masked image modeling for dense geometric matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21909–21918, 2023.
  90. Infragan: A gan architecture to transfer visible images to infrared domain. Pattern Recognition Letters, 155, 2022.
Citations (2)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.