Are Semi-Dense Detector-Free Methods Good at Matching Local Features? (2402.08671v3)
Abstract: Semi-dense detector-free approaches (SDF), such as LoFTR, are currently among the most popular image matching methods. While SDF methods are trained to establish correspondences between two images, their performances are almost exclusively evaluated using relative pose estimation metrics. Thus, the link between their ability to establish correspondences and the quality of the resulting estimated pose has thus far received little attention. This paper is a first attempt to study this link. We start with proposing a novel structured attention-based image matching architecture (SAM). It allows us to show a counter-intuitive result on two datasets (MegaDepth and HPatches): on the one hand SAM either outperforms or is on par with SDF methods in terms of pose/homography estimation metrics, but on the other hand SDF approaches are significantly better than SAM in terms of matching accuracy. We then propose to limit the computation of the matching accuracy to textured regions, and show that in this case SAM often surpasses SDF methods. Our findings highlight a strong correlation between the ability to establish accurate correspondences in textured regions and the accuracy of the resulting estimated pose/homography. Our code will be made available.
- Map-free visual relocalization: Metric pose relative to a single image. In ECCV.
- Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In CVPR.
- Aspanformer: Detector-free image matching with adaptive span transformer. In ECCV.
- D2-net: A trainable CNN for joint description and detection of local features. In CVPR.
- Dkm: Dense kernelized feature matching for geometry estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17765–17775.
- S2DNet: learning image features for accurate sparse-to-dense matching. In ECCV.
- Neural reprojection error: Merging feature learning and camera pose estimation. In CVPR.
- Visual correspondence hallucination. In ICLR.
- TopicFM: Robust and interpretable feature matching with topic-assisted. In AAAI.
- Reconstructing the World* in Six Days *(as Captured by the Yahoo 100 Million Image Dataset). In CVPR.
- Perceiver IO: A general architecture for structured inputs & outputs. In ICLR.
- Perceiver: General perception with iterative attention. In ICML.
- Image Matching across Wide Baselines: From Paper to Practice. IJCV.
- Transformers are RNNs: Fast autoregressive transformers with linear attention. In ICML.
- Dual-resolution correspondence networks. NeurIPS.
- Megadepth: Learning single-view depth prediction from internet photos. In CVPR.
- Feature pyramid networks for object detection. In CVPR.
- Lowe, D. G. (1999). Object recognition from local scale-invariant features. In ICCV.
- 3DG-STFM: 3D geometric guided student-teacher feature matching. In ECCV.
- PATS: Patch area transportation with subdivision for local feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17776–17786.
- LF-Net: learning local features from images. NeurIPS.
- R2D2: reliable and repeatable detector and descriptor. NeurIPS.
- Efficient neighbourhood consensus networks via submanifold sparse convolutions. In ECCV.
- Neighbourhood consensus networks. NeurIPS.
- NCNet: neighbourhood consensus networks for estimating image correspondences. PAMI.
- Superglue: Learning feature matching with graph neural networks. In CVPR.
- LaMAR: Benchmarking Localization and Mapping for Augmented Reality. In ECCV.
- Are large-scale 3D models really necessary for accurate visual localization? In CVPR.
- Structure-from-motion revisited. In CVPR.
- BAD SLAM: Bundle adjusted direct RGB-D SLAM. In CVPR.
- Double window optimisation for constant time visual slam. In ICCV.
- LoFTR: detector-free local feature matching with transformers. In CVPR.
- City-scale localization for cameras with known vertical direction. PAMI.
- InLoc: Indoor visual localization with dense matching and view synthesis. In CVPR.
- Quadtree attention for vision transformers. In ICLR.
- Glu-net: Global-local universal network for dense flow and correspondences. In CVPR.
- Learning accurate dense correspondences and when to trust them. In CVPR.
- Attention is all you need. NeurIPS.
- Matchformer: Interleaving attention in transformers for feature matching. In ACCV.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV.
- Segformer: Simple and efficient design for semantic segmentation with transformers. NeurIPS.
- LIFT: Learned invariant feature transform. In ECCV.
- Reference pose generation for long-term visual localization via learned features and view synthesis. IJCV.
- Patch2Pix: Epipolar-guided pixel-level correspondences. In CVPR.
- Pmatch: Paired masked image modeling for dense geometric matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21909–21918.