ETO:Efficient Transformer-based Local Feature Matching by Organizing Multiple Homography Hypotheses (2410.22733v4)
Abstract: We tackle the efficiency problem of learning local feature matching. Recent advancements have given rise to purely CNN-based and transformer-based approaches, each augmented with deep learning techniques. While CNN-based methods often excel in matching speed, transformer-based methods tend to provide more accurate matches. We propose an efficient transformer-based network architecture for local feature matching. This technique is built on constructing multiple homography hypotheses to approximate the continuous correspondence in the real world and uni-directional cross-attention to accelerate the refinement. On the YFCC100M dataset, our matching accuracy is competitive with LoFTR, a state-of-the-art transformer-based architecture, while the inference speed is boosted to 4 times, even outperforming the CNN-based methods. Comprehensive evaluations on other open datasets such as Megadepth, ScanNet, and HPatches demonstrate our method's efficacy, highlighting its potential to significantly enhance a wide array of downstream applications.
- ORB: an efficient alternative to SIFT or SURF. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2564–2571. IEEE, 2011.
- Speeded-up robust features (SURF). Comput. Vis. Image Underst., 110(3):346–359, 2008.
- ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robotics, 31(5):1147–1163, 2015.
- Vox-fusion: Dense tracking and mapping with voxel-based neural implicit representation. In IEEE International Symposium on Mixed and Augmented Reality, pages 499–507. IEEE, 2022a.
- Multi-modal neural radiance field for monocular dense slam with a light-weight tof sensor. In Proceedings of the ieee/cvf international conference on computer vision, pages 1–11, 2023.
- Cg-slam: Efficient dense rgb-d slam in a consistent uncertainty-aware 3d gaussian field. arXiv preprint arXiv:2403.16095, 2024a.
- Cp-slam: Collaborative neural point-based slam system. Advances in Neural Information Processing Systems, 36, 2024b.
- Nis-slam: Neural implicit semantic rgb-d slam for 3d consistent scene understanding. IEEE Transactions on Visualization and Computer Graphics, pages 1–11, 2024a.
- Multi-view neural 3d reconstruction of micro-and nanostructures with atomic force microscopy. Communications Engineering, 3(1):131, 2024.
- Neural rendering in a room: amodal 3d understanding and free-viewpoint rendering for the closed scene composed of pre-captured objects. ACM Transactions on Graphics (TOG), 41(4):1–10, 2022b.
- From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019a.
- Vs-net: Voting with segmentation for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6101–6111, 2021.
- Splatloc: 3d gaussian splatting-based visual localization for augmented reality. arXiv preprint arXiv:2409.14067, 2024b.
- Rnnpose: Recurrent 6-dof object pose refinement with robust correspondence field estimation and pose optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14880–14890, 2022.
- Generative category-level shape and pose estimation with semantic primitives. In Conference on Robot Learning, pages 1390–1400. PMLR, 2022a.
- Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 224–236, 2018.
- D2-net: A trainable CNN for joint description and detection of local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8092–8101, 2019.
- Pats: Patch area transportation with subdivision for local feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17776–17786, 2023.
- Aspanformer: Detector-free image matching with adaptive span transformer. In European Conference on Computer Vision, 2022.
- Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8922–8931, 2021.
- Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
- Particle video revisited: Tracking through occlusions using point trajectories. In European Conference on Computer Vision, pages 59–75. Springer, 2022.
- Blinkvision: A benchmark for optical flow, scene flow and point tracking estimation using rgb frames and events. In European conference on computer vision. Springer, 2024.
- Context-pips: Persistent independent particles demands context features. Advances in Neural Information Processing Systems, 36, 2024.
- From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019b.
- Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph., 28(3):24, 2009.
- Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
- LightGlue: Local Feature Matching at Light Speed. In ICCV, 2023.
- Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
- YFCC100M: the new data in multimedia research. Commun. ACM, 59(2):64–73, 2016.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017.
- Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In CVPR, 2017.
- David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis., 60(2):91–110, 2004.
- R2D2: reliable and repeatable detector and descriptor. Advances in Neural Information Processing Systems, 32, 2019.
- Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4938–4947, 2020.
- Learning to find good correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2666–2674, 2018.
- COTR: correspondence transformer for matching across images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 171–180. Springer, 2021.
- Eco-tr: Efficient correspondences finding via coarse-to-fine refinement. In European Conference on Computer Vision, pages 317–334. Springer, 2022.
- DELTAR: depth estimation from a light-weight tof sensor and RGB image. In European Conference on Computer Vision, pages 619–636. Springer, 2022b.
- Flowformer: A transformer architecture and its masked cost volume autoencoding for optical flow. arXiv preprint arXiv:2306.05442, 2023.
- Blinkflow: A dataset to push the limits of event-based optical flow estimation. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3881–3888. IEEE, 2023.
- Quadtree attention for vision transformers. In The International Conference on Learning Representations. OpenReview.net, 2021.
- Pdc-net+: Enhanced probabilistic dense correspondence network. arXiv preprint arXiv:2109.13912, 2021.
- Dkm: Dense kernelized feature matching for geometry estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17765–17775, 2023.
- Rgm: A robust generalist matching model. arXiv preprint arXiv:2310.11755, 2023.
- RAFT: recurrent all-pairs field transforms for optical flow. In European Conference on Computer Vision, pages 402–419. Springer, 2020.
- Aslfeat: Learning local features of accurate shape and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6589–6598, 2020.
- Tony Lindeberg. Feature detection with automatic scale selection. Int. J. Comput. Vis., 30(2):79–116, 1998.
- Scalenet: A shallow architecture for scale estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12808–12818, 2022.
- Occ^ 2net: Robust image matching based on 3d occupancy estimation for occluded regions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9652–9662, 2023.
- Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
- Depth-adaptive transformer. arXiv preprint arXiv:1910.10073, 2019.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural Information Processing Systems, 26, 2013.
- Group-wise correlation stereo network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3273–3282, 2019.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
- Irving John Good. Rational decisions. Journal of the Royal Statistical Society: Series B (Methodological), 14(1):107–114, 1952.
- Patch2pix: Epipolar-guided pixel-level correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4669–4678, 2021.
- Roma: Robust dense feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19790–19800, 2024.