Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition (2402.14505v3)

Published 22 Feb 2024 in cs.CV and cs.AI

Abstract: Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model pre-training and VPR, how to bridge the gap and fully unleash the capability of pre-trained models for VPR is still a key issue to address. To this end, we propose a novel method to realize seamless adaptation of pre-trained models for VPR. Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method to achieve both global and local adaptation efficiently, in which only lightweight adapters are tuned without adjusting the pre-trained model. Besides, to guide effective adaptation, we propose a mutual nearest neighbor local feature loss, which ensures proper dense local features are produced for local matching and avoids time-consuming spatial verification in re-ranking. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time, and uses about only 3% retrieval runtime of the two-stage VPR methods with RANSAC-based spatial verification. It ranks 1st on the MSLS challenge leaderboard (at the time of submission). The code is released at https://github.com/Lu-Feng/SelaVPR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Gsv-cities: Toward appropriate supervised visual place recognition. Neurocomputing, 513:194–203, 2022.
  2. Mixvpr: Feature mixing for visual place recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  2998–3007, 2023.
  3. Fast and incremental method for loop-closure detection using bags of visual words. IEEE transactions on robotics, 24(5):1027–1037, 2008.
  4. Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  5297–5307, 2016.
  5. Speeded-up robust features (surf). Computer vision and image understanding, 110(3):346–359, 2008.
  6. Viewpoint invariant dense matching for visual geolocalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  12169–12178, 2021.
  7. Rethinking visual geo-localization for large-scale applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4878–4888, 2022a.
  8. Deep visual geo-localization benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5396–5407, 2022b.
  9. Unifying deep local and global features for image search. In European Conference on Computer Vision, pp.  726–743. Springer, 2020.
  10. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9650–9660, 2021.
  11. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678, 2022.
  12. Deep learning features at scale for visual place recognition. In 2017 IEEE international conference on robotics and automation (ICRA), pp.  3223–3230. IEEE, 2017a.
  13. Only look once, mining distinctive landmarks from convnet for visual place recognition. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  9–16. IEEE, 2017b.
  14. Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition, pp.  248–255, 2009.
  15. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp.  224–236, 2018.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  17. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  18. Seqnet: Learning descriptors for sequence-based hierarchical place recognition. IEEE Robotics and Automation Letters, 6(3):4305–4312, 2021.
  19. Improving condition-and environment-invariant place recognition with semantic place categorization. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  6863–6870. IEEE, 2017.
  20. Don’t look back: Robustifying place categorization for viewpoint-and condition-invariant place recognition. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.  3645–3652, 2018.
  21. Self-supervising fine-grained region similarities for large-scale image localization. In European conference on computer vision, pp.  369–386. Springer, 2020.
  22. Fab-map+ ratslam: Appearance-based slam for multiple times of day. In 2010 IEEE international conference on robotics and automation, pp.  3507–3512. IEEE, 2010.
  23. Hierarchical multi-process fusion for visual place recognition. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp.  3327–3333. IEEE, 2020.
  24. Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14141–14152, 2021.
  25. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
  26. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  27. Aggregating local descriptors into a compact image representation. In 2010 IEEE computer society conference on computer vision and pattern recognition, pp.  3304–3311. IEEE, 2010.
  28. Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039, 2022.
  29. Learned contextual feature reweighting for image geo-localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  2136–2145, 2017.
  30. Anyloc: Towards universal visual place recognition. arXiv preprint arXiv:2308.00688, 2023.
  31. A holistic visual place recognition approach using lightweight cnns for significant viewpoint and appearance changes. IEEE transactions on robotics, 36(2):561–569, 2019.
  32. Contrastive alignment of vision to language through parameter-efficient transfer learning. In The Eleventh International Conference on Learning Representations, 2023.
  33. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  34. Generalized contrastive optimization of siamese networks for place recognition. arXiv preprint arXiv:2103.06638, 2021.
  35. Data-efficient large scale place recognition with graded similarity supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  23487–23496, 2023.
  36. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2041–2050, 2018.
  37. Stochastic attraction-repulsion embedding for large scale image localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2570–2579, 2019.
  38. Lightweight, viewpoint-invariant visual place recognition in changing environments. IEEE Robotics and Automation Letters, 3(2):957–964, 2018.
  39. Visual place recognition: A survey. ieee transactions on robotics, 32(1):1–19, 2015.
  40. Sta-vpr: Spatio-temporal alignment for visual place recognition. IEEE Robotics and Automation Letters, 6(3):4297–4304, 2021.
  41. Aanet: Aggregation and alignment network with semi-hard positive sample mining for hierarchical place recognition. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  11771–11778. IEEE, 2023.
  42. Scalable 6-dof localization on mobile devices. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, pp. 268–283. Springer, 2014.
  43. Semantics-aware visual localization under challenging perceptual conditions. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp.  2614–2620. IEEE, 2017.
  44. Single-view place recognition under seasonal changes. arXiv preprint arXiv:1808.06516, 2018.
  45. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  46. St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35:26462–26477, 2022.
  47. Dual-path adaptation from image to video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2203–2213, 2023.
  48. Attentional pyramid pooling of salient visual residuals for place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  885–894, 2021.
  49. Fine-tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018.
  50. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  51. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4938–4947, 2020.
  52. Tcl: Tightly coupled learning strategy for weakly supervised hierarchical place recognition. IEEE Robotics and Automation Letters, 7(2):2684–2691, 2022.
  53. Structvpr: Distill structural knowledge with weighting samples for visual place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11217–11226, 2023.
  54. On the performance of convnet features for place recognition. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp.  4297–4304. IEEE, 2015.
  55. Visual place recognition with repetitive structures. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  883–890, 2013.
  56. 24/7 place recognition by view synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1808–1817, 2015.
  57. Transvpr: Transformer-based place recognition with multi-level attention aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13648–13657, 2022a.
  58. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022b.
  59. Mapillary street-level sequences: A dataset for lifelong place recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  2626–2635, 2020.
  60. Localizing discriminative visual landmarks for place recognition. In 2019 International conference on robotics and automation (ICRA), pp.  5979–5985. IEEE, 2019.
  61. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2945–2954, 2023.
  62. Probabilistic visual place recognition for hierarchical localization. IEEE Robotics and Automation Letters, 6(2):311–318, 2020.
  63. Aim: Adapting image models for efficient video action recognition. 2023.
  64. A multi-domain feature learning method for visual place recognition. In 2019 International Conference on Robotics and Automation (ICRA), pp.  319–324. IEEE, 2019.
  65. Spatial pyramid-enhanced netvlad with weighted triplet loss for place recognition. IEEE transactions on neural networks and learning systems, 31(2):661–674, 2019.
  66. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  67. R2former: Unified retrieval and reranking transformer for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19370–19380, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Feng Lu (86 papers)
  2. Lijun Zhang (239 papers)
  3. Xiangyuan Lan (25 papers)
  4. Shuting Dong (7 papers)
  5. Yaowei Wang (151 papers)
  6. Chun Yuan (127 papers)
Citations (16)

Summary

  • The paper introduces SelaVPR, a hybrid adaptation framework that integrates global and local adaptation modules to bridge pre-trained models with VPR tasks.
  • It proposes a novel mutual nearest neighbor local feature loss that refines dense local feature extraction and improves matching efficiency.
  • The method outperforms state-of-the-art approaches on benchmarks like Tokyo24/7 and Pitts30k while needing significantly less training data and computational resources.

Seamless Adaptation of Pre-trained Foundation Models for Visual Place Recognition

The paper "Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition" addresses a significant challenge in leveraging pre-trained vision models for Visual Place Recognition (VPR). The primary focus is on bridging the gap between the pre-training tasks and VPR to harness the full potential of pre-trained models. This paper introduces a novel method, named SelaVPR, aimed at efficiently adapting foundation models like DINOv2 for the VPR task.

Key Contributions

  1. Hybrid Adaptation Framework:
    • The authors propose a hybrid adaptation mechanism that consists of global and local adaptation modules. The method employs lightweight adapters that facilitate global and local feature extraction without modifying the parameters of the pre-trained model. This hybrid adaptation efficiently utilizes the robust features of foundation models to enhance place recognition capabilities, focusing on salient landmarks crucial for identifying places.
  2. Mutual Nearest Neighbor Local Feature Loss:
    • A new loss function, the mutual nearest neighbor local feature loss, is introduced to guide the local adaptation module. It ensures that the adaptation process yields effective dense local features utilized for local matching. By avoiding time-intensive spatial verification techniques such as RANSAC, the proposed method significantly reduces retrieval runtime.
  3. Performance and Efficiency:
    • The proposed SelaVPR method demonstrates superior performance on various VPR benchmarks, including Tokyo24/7, MSLS, and Pitts30k, outperforming several state-of-the-art methods. Noteworthy is the method's ability to achieve these results with substantially less training data and computational requirements. For example, it utilizes only about 3% of the retrieval runtime compared to traditional two-stage VPR methods that rely on geometric verifications.

Methodology

The hybrid adaptation method leverages the Vision Transformer (ViT)-based pre-trained foundation model, DINOv2.

  • Global Adaptation:
    • This involves introducing adapters within transformer blocks to adjust the global feature extraction process, ensuring that the output representation is finely tuned to the VPR task.
  • Local Adaptation:
    • The local adaptation employs up-convolutional layers to upsample feature maps, enabling the model to produce dense local features essential for re-ranking in the two-stage VPR pipeline.

Implications and Future Directions

The research presents a well-structured solution to fully exploit pre-trained foundation models for VPR, efficiently bridging the pre-training fine-tuning gap. This holds substantial implications for improving VPR systems, particularly in dynamic environments with changing conditions and viewpoints. The proposed approach's efficiency and effectiveness pave the way for real-world large-scale VPR deployment and could be extended to other domain-specific recognition tasks.

Future work can explore enhancing the robustness of local feature adaptations, as well as integrating more advanced fine-tuning strategies that minimize the impact of domain shifts between pre-training and target tasks. Additionally, further exploration of the foundational models' capabilities in different environmental conditions can provide deeper insights and broader applications in diverse VPR scenarios.