Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition (2402.14505v3)
Abstract: Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model pre-training and VPR, how to bridge the gap and fully unleash the capability of pre-trained models for VPR is still a key issue to address. To this end, we propose a novel method to realize seamless adaptation of pre-trained models for VPR. Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method to achieve both global and local adaptation efficiently, in which only lightweight adapters are tuned without adjusting the pre-trained model. Besides, to guide effective adaptation, we propose a mutual nearest neighbor local feature loss, which ensures proper dense local features are produced for local matching and avoids time-consuming spatial verification in re-ranking. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time, and uses about only 3% retrieval runtime of the two-stage VPR methods with RANSAC-based spatial verification. It ranks 1st on the MSLS challenge leaderboard (at the time of submission). The code is released at https://github.com/Lu-Feng/SelaVPR.
- Gsv-cities: Toward appropriate supervised visual place recognition. Neurocomputing, 513:194–203, 2022.
- Mixvpr: Feature mixing for visual place recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2998–3007, 2023.
- Fast and incremental method for loop-closure detection using bags of visual words. IEEE transactions on robotics, 24(5):1027–1037, 2008.
- Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5297–5307, 2016.
- Speeded-up robust features (surf). Computer vision and image understanding, 110(3):346–359, 2008.
- Viewpoint invariant dense matching for visual geolocalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12169–12178, 2021.
- Rethinking visual geo-localization for large-scale applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4878–4888, 2022a.
- Deep visual geo-localization benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5396–5407, 2022b.
- Unifying deep local and global features for image search. In European Conference on Computer Vision, pp. 726–743. Springer, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660, 2021.
- Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678, 2022.
- Deep learning features at scale for visual place recognition. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3223–3230. IEEE, 2017a.
- Only look once, mining distinctive landmarks from convnet for visual place recognition. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9–16. IEEE, 2017b.
- Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition, pp. 248–255, 2009.
- Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 224–236, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
- Seqnet: Learning descriptors for sequence-based hierarchical place recognition. IEEE Robotics and Automation Letters, 6(3):4305–4312, 2021.
- Improving condition-and environment-invariant place recognition with semantic place categorization. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6863–6870. IEEE, 2017.
- Don’t look back: Robustifying place categorization for viewpoint-and condition-invariant place recognition. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3645–3652, 2018.
- Self-supervising fine-grained region similarities for large-scale image localization. In European conference on computer vision, pp. 369–386. Springer, 2020.
- Fab-map+ ratslam: Appearance-based slam for multiple times of day. In 2010 IEEE international conference on robotics and automation, pp. 3507–3512. IEEE, 2010.
- Hierarchical multi-process fusion for visual place recognition. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 3327–3333. IEEE, 2020.
- Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14141–14152, 2021.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Aggregating local descriptors into a compact image representation. In 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 3304–3311. IEEE, 2010.
- Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039, 2022.
- Learned contextual feature reweighting for image geo-localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2136–2145, 2017.
- Anyloc: Towards universal visual place recognition. arXiv preprint arXiv:2308.00688, 2023.
- A holistic visual place recognition approach using lightweight cnns for significant viewpoint and appearance changes. IEEE transactions on robotics, 36(2):561–569, 2019.
- Contrastive alignment of vision to language through parameter-efficient transfer learning. In The Eleventh International Conference on Learning Representations, 2023.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Generalized contrastive optimization of siamese networks for place recognition. arXiv preprint arXiv:2103.06638, 2021.
- Data-efficient large scale place recognition with graded similarity supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23487–23496, 2023.
- Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2041–2050, 2018.
- Stochastic attraction-repulsion embedding for large scale image localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2570–2579, 2019.
- Lightweight, viewpoint-invariant visual place recognition in changing environments. IEEE Robotics and Automation Letters, 3(2):957–964, 2018.
- Visual place recognition: A survey. ieee transactions on robotics, 32(1):1–19, 2015.
- Sta-vpr: Spatio-temporal alignment for visual place recognition. IEEE Robotics and Automation Letters, 6(3):4297–4304, 2021.
- Aanet: Aggregation and alignment network with semi-hard positive sample mining for hierarchical place recognition. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11771–11778. IEEE, 2023.
- Scalable 6-dof localization on mobile devices. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, pp. 268–283. Springer, 2014.
- Semantics-aware visual localization under challenging perceptual conditions. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2614–2620. IEEE, 2017.
- Single-view place recognition under seasonal changes. arXiv preprint arXiv:1808.06516, 2018.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35:26462–26477, 2022.
- Dual-path adaptation from image to video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2203–2213, 2023.
- Attentional pyramid pooling of salient visual residuals for place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 885–894, 2021.
- Fine-tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4938–4947, 2020.
- Tcl: Tightly coupled learning strategy for weakly supervised hierarchical place recognition. IEEE Robotics and Automation Letters, 7(2):2684–2691, 2022.
- Structvpr: Distill structural knowledge with weighting samples for visual place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11217–11226, 2023.
- On the performance of convnet features for place recognition. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 4297–4304. IEEE, 2015.
- Visual place recognition with repetitive structures. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 883–890, 2013.
- 24/7 place recognition by view synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1808–1817, 2015.
- Transvpr: Transformer-based place recognition with multi-level attention aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13648–13657, 2022a.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022b.
- Mapillary street-level sequences: A dataset for lifelong place recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2626–2635, 2020.
- Localizing discriminative visual landmarks for place recognition. In 2019 International conference on robotics and automation (ICRA), pp. 5979–5985. IEEE, 2019.
- Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2945–2954, 2023.
- Probabilistic visual place recognition for hierarchical localization. IEEE Robotics and Automation Letters, 6(2):311–318, 2020.
- Aim: Adapting image models for efficient video action recognition. 2023.
- A multi-domain feature learning method for visual place recognition. In 2019 International Conference on Robotics and Automation (ICRA), pp. 319–324. IEEE, 2019.
- Spatial pyramid-enhanced netvlad with weighted triplet loss for place recognition. IEEE transactions on neural networks and learning systems, 31(2):661–674, 2019.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- R2former: Unified retrieval and reranking transformer for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19370–19380, 2023.
- Feng Lu (86 papers)
- Lijun Zhang (239 papers)
- Xiangyuan Lan (25 papers)
- Shuting Dong (7 papers)
- Yaowei Wang (151 papers)
- Chun Yuan (127 papers)