Residual Learning for Image Point Descriptors (2312.15471v1)
Abstract: Local image feature descriptors have had a tremendous impact on the development and application of computer vision methods. It is therefore unsurprising that significant efforts are being made for learning-based image point descriptors. However, the advantage of learned methods over handcrafted methods in real applications is subtle and more nuanced than expected. Moreover, handcrafted descriptors such as SIFT and SURF still perform better point localization in Structure-from-Motion (SfM) compared to many learned counterparts. In this paper, we propose a very simple and effective approach to learning local image descriptors by using a hand-crafted detector and descriptor. Specifically, we choose to learn only the descriptors, supported by handcrafted descriptors while discarding the point localization head. We optimize the final descriptor by leveraging the knowledge already present in the handcrafted descriptor. Such an approach of optimization allows us to discard learning knowledge already present in non-differentiable functions such as the hand-crafted descriptors and only learn the residual knowledge in the main network branch. This offers 50X convergence speed compared to the standard baseline architecture of SuperPoint while at inference the combined descriptor provides superior performance over the learned and hand-crafted descriptors. This is done with minor increase in the computations over the baseline learned descriptor. Our approach has potential applications in ensemble learning and learning with non-differentiable functions. We perform experiments in matching, camera localization and Structure-from-Motion in order to showcase the advantages of our approach.
- David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.
- Surf: Speeded up robust features. In ECCV, 2006.
- Orb: An efficient alternative to sift or surf. In ICCV, 2011.
- Kaze features. In ECCV, 2012.
- Brief: Binary robust independent elementary features. In ECCV, 2010.
- Daisy: An efficient dense descriptor applied to wide-baseline stereo. T-PAMI, 32(5):815–830, 2009.
- Superpoint: Self-supervised interest point detection and description. In CVPRW, 2018.
- Structure-from-motion revisited. In CVPR, 2016.
- Modeling the world from internet photo collections. International journal of computer vision, 80:189–210, 2008.
- Changchang Wu. Towards linear-time incremental structure from motion. In 2013 International Conference on 3D Vision-3DV 2013, pages 127–134. IEEE, 2013.
- Orb-slam: a versatile and accurate monocular slam system. IEEE T-RO, 31(5):1147–1163, 2015.
- Monoslam: Real-time single camera slam. T-PAMI, 29(6):1052–1067, 2007.
- An experimental comparison of ros-compatible stereo visual slam methods for planetary rovers. In 2018 5th IEEE International Workshop on Metrology for AeroSpace (MetroAeroSpace), pages 386–391. IEEE, 2018.
- Multibody structure-from-motion in practice. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(6):1134–1141, 2010.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Neural 3d scene reconstruction with the manhattan-world assumption. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5511–5520, 2022.
- Indexing based on scale invariant interest points. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 1, pages 525–531. IEEE, 2001.
- A combined corner and edge detector. In Alvey vision conference, volume 15, pages 10–5244, 1988.
- Machine learning for high-speed corner detection. In Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9, pages 430–443. Springer, 2006.
- Multi-task self-supervised visual learning. In Proceedings of the IEEE international conference on computer vision, pages 2051–2060, 2017.
- Deep learning face representation by joint identification-verification. Advances in neural information processing systems, 27, 2014.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- R2d2: repeatable and reliable detector and descriptor. In NeurIPS, 2019.
- Neural outlier rejection for self-supervised keypoint learning. In ICLR, 2020.
- Matchformer: Interleaving attention in transformers for feature matching. In Proceedings of the Asian Conference on Computer Vision (ACCV), pages 2746–2762, December 2022.
- Comparative evaluation of hand-crafted and learned local features. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1482–1491, 2017.
- Image matching across wide baselines: From paper to practice. International Journal of Computer Vision, 129(2):517–547, 2021.
- Disk: Learning local features with policy gradient. Advances in Neural Information Processing Systems, 33:14254–14265, 2020.
- Universal correspondence network. Advances in neural information processing systems, 29, 2016.
- Working hard to know your neighbor’s margins: Local descriptor learning loss. Advances in neural information processing systems, 30, 2017.
- Cross-descriptor visual localization and mapping. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6058–6067, 2021.
- Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931, 2021.
- Probabilistic warp consistency for weakly-supervised semantic correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8708–8718, 2022.
- Tilde: A temporally invariant learned detector. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5279–5288, 2015.
- Lift: Learned invariant feature transform. In ECCV, 2016.
- Learning covariant feature detectors. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part III 14, pages 100–117. Springer, 2016.
- Learning to detect features in texture images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6325–6333, 2018.
- Learning correspondence uncertainty via differentiable nonlinear least squares. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13102–13112, 2023.
- Self-supervised equivariant learning for oriented keypoint detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4847–4857, 2022.
- Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896, 2013.
- Zippypoint: Fast interest point detection, description, and matching through mixed precision discretization. arXiv preprint arXiv:2203.03610, 2022.
- Three things everyone should know to improve object retrieval. In 2012 IEEE conference on computer vision and pattern recognition, pages 2911–2918. IEEE, 2012.
- Dualnet: Learn complementary features for image recognition. In Proceedings of the IEEE International conference on computer vision, pages 502–510, 2017.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5173–5182, 2017.
- Benchmarking 6dof outdoor visual localization in changing conditions. In CVPR, 2018.
- Image retrieval for image-based localization revisited. In BMVC, 2012.
- From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019.
- Netvlad: Cnn architecture for weakly supervised place recognition. In CVPR, 2016.
- Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
- Two-view geometry estimation unaffected by a dominant plane. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 772–779 vol. 1, 2005.