Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

ETO:Efficient Transformer-based Local Feature Matching by Organizing Multiple Homography Hypotheses (2410.22733v4)

Published 30 Oct 2024 in cs.CV

Abstract: We tackle the efficiency problem of learning local feature matching. Recent advancements have given rise to purely CNN-based and transformer-based approaches, each augmented with deep learning techniques. While CNN-based methods often excel in matching speed, transformer-based methods tend to provide more accurate matches. We propose an efficient transformer-based network architecture for local feature matching. This technique is built on constructing multiple homography hypotheses to approximate the continuous correspondence in the real world and uni-directional cross-attention to accelerate the refinement. On the YFCC100M dataset, our matching accuracy is competitive with LoFTR, a state-of-the-art transformer-based architecture, while the inference speed is boosted to 4 times, even outperforming the CNN-based methods. Comprehensive evaluations on other open datasets such as Megadepth, ScanNet, and HPatches demonstrate our method's efficacy, highlighting its potential to significantly enhance a wide array of downstream applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. ORB: an efficient alternative to SIFT or SURF. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2564–2571. IEEE, 2011.
  2. Speeded-up robust features (SURF). Comput. Vis. Image Underst., 110(3):346–359, 2008.
  3. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robotics, 31(5):1147–1163, 2015.
  4. Vox-fusion: Dense tracking and mapping with voxel-based neural implicit representation. In IEEE International Symposium on Mixed and Augmented Reality, pages 499–507. IEEE, 2022a.
  5. Multi-modal neural radiance field for monocular dense slam with a light-weight tof sensor. In Proceedings of the ieee/cvf international conference on computer vision, pages 1–11, 2023.
  6. Cg-slam: Efficient dense rgb-d slam in a consistent uncertainty-aware 3d gaussian field. arXiv preprint arXiv:2403.16095, 2024a.
  7. Cp-slam: Collaborative neural point-based slam system. Advances in Neural Information Processing Systems, 36, 2024b.
  8. Nis-slam: Neural implicit semantic rgb-d slam for 3d consistent scene understanding. IEEE Transactions on Visualization and Computer Graphics, pages 1–11, 2024a.
  9. Multi-view neural 3d reconstruction of micro-and nanostructures with atomic force microscopy. Communications Engineering, 3(1):131, 2024.
  10. Neural rendering in a room: amodal 3d understanding and free-viewpoint rendering for the closed scene composed of pre-captured objects. ACM Transactions on Graphics (TOG), 41(4):1–10, 2022b.
  11. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019a.
  12. Vs-net: Voting with segmentation for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6101–6111, 2021.
  13. Splatloc: 3d gaussian splatting-based visual localization for augmented reality. arXiv preprint arXiv:2409.14067, 2024b.
  14. Rnnpose: Recurrent 6-dof object pose refinement with robust correspondence field estimation and pose optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14880–14890, 2022.
  15. Generative category-level shape and pose estimation with semantic primitives. In Conference on Robot Learning, pages 1390–1400. PMLR, 2022a.
  16. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 224–236, 2018.
  17. D2-net: A trainable CNN for joint description and detection of local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8092–8101, 2019.
  18. Pats: Patch area transportation with subdivision for local feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17776–17786, 2023.
  19. Aspanformer: Detector-free image matching with adaptive span transformer. In European Conference on Computer Vision, 2022.
  20. Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8922–8931, 2021.
  21. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  22. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
  23. Particle video revisited: Tracking through occlusions using point trajectories. In European Conference on Computer Vision, pages 59–75. Springer, 2022.
  24. Blinkvision: A benchmark for optical flow, scene flow and point tracking estimation using rgb frames and events. In European conference on computer vision. Springer, 2024.
  25. Context-pips: Persistent independent particles demands context features. Advances in Neural Information Processing Systems, 36, 2024.
  26. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019b.
  27. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph., 28(3):24, 2009.
  28. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  29. LightGlue: Local Feature Matching at Light Speed. In ICCV, 2023.
  30. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  31. YFCC100M: the new data in multimedia research. Commun. ACM, 59(2):64–73, 2016.
  32. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017.
  33. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In CVPR, 2017.
  34. David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis., 60(2):91–110, 2004.
  35. R2D2: reliable and repeatable detector and descriptor. Advances in Neural Information Processing Systems, 32, 2019.
  36. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4938–4947, 2020.
  37. Learning to find good correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2666–2674, 2018.
  38. COTR: correspondence transformer for matching across images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 171–180. Springer, 2021.
  39. Eco-tr: Efficient correspondences finding via coarse-to-fine refinement. In European Conference on Computer Vision, pages 317–334. Springer, 2022.
  40. DELTAR: depth estimation from a light-weight tof sensor and RGB image. In European Conference on Computer Vision, pages 619–636. Springer, 2022b.
  41. Flowformer: A transformer architecture and its masked cost volume autoencoding for optical flow. arXiv preprint arXiv:2306.05442, 2023.
  42. Blinkflow: A dataset to push the limits of event-based optical flow estimation. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3881–3888. IEEE, 2023.
  43. Quadtree attention for vision transformers. In The International Conference on Learning Representations. OpenReview.net, 2021.
  44. Pdc-net+: Enhanced probabilistic dense correspondence network. arXiv preprint arXiv:2109.13912, 2021.
  45. Dkm: Dense kernelized feature matching for geometry estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17765–17775, 2023.
  46. Rgm: A robust generalist matching model. arXiv preprint arXiv:2310.11755, 2023.
  47. RAFT: recurrent all-pairs field transforms for optical flow. In European Conference on Computer Vision, pages 402–419. Springer, 2020.
  48. Aslfeat: Learning local features of accurate shape and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6589–6598, 2020.
  49. Tony Lindeberg. Feature detection with automatic scale selection. Int. J. Comput. Vis., 30(2):79–116, 1998.
  50. Scalenet: A shallow architecture for scale estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12808–12818, 2022.
  51. Occ^ 2net: Robust image matching based on 3d occupancy estimation for occluded regions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9652–9662, 2023.
  52. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
  53. Depth-adaptive transformer. arXiv preprint arXiv:1910.10073, 2019.
  54. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  55. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  56. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural Information Processing Systems, 26, 2013.
  57. Group-wise correlation stereo network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3273–3282, 2019.
  58. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  59. Irving John Good. Rational decisions. Journal of the Royal Statistical Society: Series B (Methodological), 14(1):107–114, 1952.
  60. Patch2pix: Epipolar-guided pixel-level correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4669–4678, 2021.
  61. Roma: Robust dense feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19790–19800, 2024.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com