Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed (2403.04765v2)

Published 7 Mar 2024 in cs.CV

Abstract: We present a novel method for efficiently producing semi-dense matches across images. Previous detector-free matcher LoFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios but suffers from low efficiency. We revisit its design choices and derive multiple improvements for both efficiency and accuracy. One key observation is that performing the transformer over the entire feature map is redundant due to shared local information, therefore we propose an aggregated attention mechanism with adaptive token selection for efficiency. Furthermore, we find spatial variance exists in LoFTR's fine correlation module, which is adverse to matching accuracy. A novel two-stage correlation layer is proposed to achieve accurate subpixel correspondences for accuracy improvement. Our efficiency optimized model is $\sim 2.5\times$ faster than LoFTR which can even surpass state-of-the-art efficient sparse matching pipeline SuperPoint + LightGlue. Moreover, extensive experiments show that our method can achieve higher accuracy compared with competitive semi-dense matchers, with considerable efficiency benefits. This opens up exciting prospects for large-scale or latency-sensitive applications such as image retrieval and 3D reconstruction. Project page: https://zju3dv.github.io/efficientloftr.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Building rome in a day. ICCV, 2009.
  2. Netvlad: Cnn architecture for weakly supervised place recognition. CVPR, pages 5297–5307, 2016.
  3. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In CVPR, pages 5173–5182, 2017.
  4. Speeded-up robust features (surf). CVIU, 110(3):346–359, 2008.
  5. Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition. arXiv preprint arXiv:2205.14756, 2022.
  6. Learning to match features with seeded graph matching network. In ICCV, pages 6301–6310, 2021.
  7. Aspanformer: Detector-free image matching with adaptive span transformer. In ECCV, pages 20–36. Springer, 2022.
  8. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017.
  9. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022.
  10. Superpoint: Self-supervised interest point detection and description. CVPRW, 2018.
  11. Repvgg: Making vgg-style convnets great again. In CVPR, pages 13733–13742, 2021.
  12. D2-net: A trainable cnn for joint description and detection of local features. In CVPR, pages 8092–8101, 2019.
  13. Beyond cartesian representations for local descriptors. In ICCV, pages 253–262, 2019.
  14. Dkm: Dense kernelized feature matching for geometry estimation. In CVPR, pages 17765–17775, 2023a.
  15. Roma: Revisiting robust losses for dense feature matching. arXiv preprint arXiv:2305.15404, 2023b.
  16. Multiscale vision transformers. ICCV, pages 6804–6815, 2021.
  17. Topicfm: Robust and interpretable topic-assisted feature matching. In AAAI, 2023.
  18. Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. CVPR, pages 14136–14147, 2021.
  19. Deep residual learning for image recognition. CVPR, pages 770–778, 2015.
  20. Detector-free structure from motion. In CVPR, 2024.
  21. Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML, 2020.
  22. Reformer: The efficient transformer. ArXiv, abs/2001.04451, 2020.
  23. Key.net: Keypoint detection by handcrafted and learned cnn filters. ICCV, pages 5835–5843, 2019.
  24. xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
  25. Dual-resolution correspondence networks. In NeurIPS, 2020.
  26. Megadepth: Learning single-view depth prediction from internet photos. In CVPR, pages 2041–2050, 2018.
  27. Pixel-perfect structure-from-motion with featuremetric refinement. ICCV, pages 5967–5977, 2021.
  28. LightGlue: Local Feature Matching at Light Speed. In ICCV, 2023.
  29. G LoweDavid. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
  30. Aslfeat: Learning local features of accurate shape and localization. In CVPR, pages 6589–6598, 2020.
  31. Working hard to know your neighbor’s margins: Local descriptor learning loss. NeurIPS, 30, 2017.
  32. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. TR, 33:1255–1262, 2016.
  33. Orb-slam: A versatile and accurate monocular slam system. TR, 31:1147–1163, 2015.
  34. Pats: Patch area transportation with subdivision for local feature matching. CVPR, pages 17776–17786, 2023.
  35. Dinov2: Learning robust visual features without supervision, 2023.
  36. R2d2: Reliable and repeatable detector and descriptor. In NeurIPS, 2019.
  37. Neighbourhood consensus networks. NeurIPS, 31, 2018.
  38. Efficient neighbourhood consensus networks via submanifold sparse convolutions. In ECCV, pages 605–621. Springer, 2020.
  39. Machine learning for high-speed corner detection. In ECCV, pages 430–443. Springer, 2006.
  40. From coarse to fine: Robust hierarchical localization at large scale. CVPR, pages 12708–12717, 2019.
  41. SuperGlue: Learning feature matching with graph neural networks. In CVPR, 2020.
  42. Back to the Feature: Learning Robust Camera Localization from Pixels to Pose. In CVPR, 2021.
  43. Benchmarking 6dof outdoor visual localization in changing conditions. In CVPR, pages 8601–8610, 2018.
  44. Quad-networks: unsupervised learning to rank for interest point detection. In CVPR, pages 1822–1830, 2017.
  45. Structure-from-motion revisited. In CVPR, 2016.
  46. Clustergnn: Cluster-based coarse-to-fine graph neural network for efficient feature matching. In CVPR, pages 12517–12526, 2022.
  47. Roformer: Enhanced transformer with rotary position embedding. ArXiv, abs/2104.09864, 2021.
  48. Loftr: Detector-free local feature matching with transformers. In CVPR, pages 8922–8931, 2021.
  49. Inloc: Indoor visual localization with dense matching and view synthesis. In CVPR, pages 7199–7209, 2018.
  50. Quadtree attention for vision transformers. ICLR, 2022.
  51. L2-net: Deep learning of discriminative patch descriptor in euclidean space. In CVPR, pages 661–669, 2017.
  52. Sosnet: Second order similarity regularization for local descriptor learning. In CVPR, pages 11016–11025, 2019.
  53. D2d: Keypoint extraction with describe to detect approach. In ACCV, 2020.
  54. Learning accurate dense correspondences and when to trust them. In CVPR, pages 5714–5724, 2021.
  55. Disk: Learning local features with policy gradient. NeurIPS, 2020.
  56. Attention is all you need. In NeurIPS, 2017.
  57. Matchformer: Interleaving attention in transformers for feature matching. In ACCV, 2022.
  58. Linformer: Self-attention with linear complexity. ArXiv, abs/2006.04768, 2020.
  59. Metaformer is actually what you need for vision. CVPR, pages 10809–10819, 2022.
Citations (15)

Summary

We haven't generated a summary for this paper yet.