SCENES: Subpixel Correspondence Estimation With Epipolar Supervision (2401.10886v1)
Abstract: Extracting point correspondences from two or more views of a scene is a fundamental computer vision problem with particular importance for relative camera pose estimation and structure-from-motion. Existing local feature matching approaches, trained with correspondence supervision on large-scale datasets, obtain highly-accurate matches on the test sets. However, they do not generalise well to new datasets with different characteristics to those they were trained on, unlike classic feature extractors. Instead, they require finetuning, which assumes that ground-truth correspondences or ground-truth camera poses and 3D structure are available. We relax this assumption by removing the requirement of 3D structure, e.g., depth maps or point clouds, and only require camera pose information, which can be obtained from odometry. We do so by replacing correspondence losses with epipolar losses, which encourage putative matches to lie on the associated epipolar line. While weaker than correspondence supervision, we observe that this cue is sufficient for finetuning existing models on new data. We then further relax the assumption of known camera poses by using pose estimates in a novel bootstrapping approach. We evaluate on highly challenging datasets, including an indoor drone dataset and an outdoor smartphone camera dataset, and obtain state-of-the-art results without strong supervision.
- Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016.
- Lucas–Kanade 20 years on: A unifying framework. IJCV, 56:221–255, 2004.
- Surf: Speeded up robust features. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9, pages 404–417. Springer, 2006.
- A light touch approach to teaching transformers multi-view geometry. In CVPR, pages 4958–4969, 2023.
- G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
- The euroc micro aerial vehicle datasets. The International Journal of Robotics Research, 2016.
- Residual enhanced visual vectors for on-device image matching. In 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR), pages 850–854. IEEE, 2011.
- Learning to match features with seeded graph matching network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6301–6310, 2021.
- Aspanformer: Detector-free image matching with adaptive span transformer. In ECCV, pages 20–36. Springer, 2022.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
- Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018.
- D2-net: A trainable cnn for joint description and detection of local features. In CVPR, pages 8092–8101, 2019a.
- D2-net: A trainable cnn for joint description and detection of local features. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 8092–8101, 2019b.
- Direct sparse odometry. IEEE TPAMI, 40(3):611–625, 2017.
- A brief review of domain adaptation. Advances in data science and information engineering: proceedings from ICDATA 2020 and IKE 2020, pages 877–894, 2021.
- Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
- A combined corner and edge detector. In Alvey vision conference, pages 10–5244. Citeseer, 1988.
- Multiple View Geometry in Computer Vision. Cambridge University Press, 2 edition, 2004.
- Epipolar transformers. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 7779–7788, 2020.
- Determining optical flow. Artificial intelligence, 17(1-3):185–203, 1981.
- Learning to estimate hidden motions with global motion aggregation. In ICCV, pages 9772–9781, 2021.
- Megadepth: Learning single-view depth prediction from internet photos. In Computer Vision and Pattern Recognition (CVPR), 2018.
- Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6197–6206, 2021.
- Relpose++: Recovering 6d poses from sparse-view observations. arXiv preprint arXiv:2305.04926, 2023.
- LightGlue: Local Feature Matching at Light Speed. In ICCV, 2023.
- David G Lowe. Object recognition from local scale-invariant features. In ICCV, pages 1150–1157. IEEE, 1999.
- ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
- Epipolar geometry based learning of multi-view depth and ego-motion from monocular sequences. In Proceedings of the 11th Indian Conference on Computer Vision, Graphics and Image Processing, pages 1–9, 2018.
- R2d2: repeatable and reliable detector and descriptor. arXiv preprint arXiv:1906.06195, 2019.
- Orb: An efficient alternative to sift or surf. In 2011 International conference on computer vision, pages 2564–2571. Ieee, 2011.
- From coarse to fine: Robust hierarchical localization at large scale. In CVPR, pages 12716–12725, 2019.
- Superglue: Learning feature matching with graph neural networks. In CVPR, pages 4938–4947, 2020.
- Pixel-perfect structure-from-motion with featuremetric refinement. IEEE TPAMI, 2023.
- Image retrieval for image-based localization revisited. In BMVC, page 4, 2012.
- Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8601–8610, 2018.
- Structure-from-motion revisited. In CVPR, pages 4104–4113, 2016.
- Clustergnn: Cluster-based coarse-to-fine graph neural network for efficient feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12517–12526, 2022.
- LoFTR: detector-free local feature matching with transformers. In CVPR, pages 8922–8931, 2021.
- InLoc: indoor visual localization with dense matching and view synthesis. In CVPR, 2018.
- RAFT: recurrent all-pairs field transforms for optical flow. In ECCV, pages 402–419. Springer, 2020.
- Long-term visual localization revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4):2074–2088, 2022.
- Are large-scale 3d models really necessary for accurate visual localization? IEEE transactions on pattern analysis and machine intelligence, 43(3):814–829, 2019.
- Disk: Learning local features with policy gradient. Advances in Neural Information Processing Systems, 33:14254–14265, 2020.
- Learning feature descriptors using camera pose supervision. In Proc. European Conference on Computer Vision (ECCV), 2020.
- Matchformer: Interleaving attention in transformers for feature matching. In ACCV, pages 2746–2762, 2022.
- A survey of unsupervised deep domain adaptation. ACM Transactions on Intelligent Systems and Technology (TIST), 11(5):1–46, 2020.
- Mvs2d: Efficient multi-view stereo via attention-driven 2d convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8574–8584, 2022.
- Input-level inductive biases for 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6176–6186, 2022.
- RelPose: Predicting probabilistic relative rotation for single objects in the wild. In ECCV, 2022.
- Dominik A. Kloepfer (2 papers)
- João F. Henriques (55 papers)
- Dylan Campbell (44 papers)