Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition (2402.19231v2)

Published 29 Feb 2024 in cs.CV and cs.RO

Abstract: Over the past decade, most methods in visual place recognition (VPR) have used neural networks to produce feature representations. These networks typically produce a global representation of a place image using only this image itself and neglect the cross-image variations (e.g. viewpoint and illumination), which limits their robustness in challenging scenes. In this paper, we propose a robust global representation method with cross-image correlation awareness for VPR, named CricaVPR. Our method uses the attention mechanism to correlate multiple images within a batch. These images can be taken in the same place with different conditions or viewpoints, or even captured from different places. Therefore, our method can utilize the cross-image variations as a cue to guide the representation learning, which ensures more robust features are produced. To further facilitate the robustness, we propose a multi-scale convolution-enhanced adaptation method to adapt pre-trained visual foundation models to the VPR task, which introduces the multi-scale local information to further enhance the cross-image correlation-aware representation. Experimental results show that our method outperforms state-of-the-art methods by a large margin with significantly less training time. The code is released at https://github.com/Lu-Feng/CricaVPR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Gsv-cities: Toward appropriate supervised visual place recognition. Neurocomputing, 513:194–203, 2022.
  2. Mixvpr: Feature mixing for visual place recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2998–3007, 2023.
  3. Fast and incremental method for loop-closure detection using bags of visual words. IEEE transactions on robotics, 24(5):1027–1037, 2008.
  4. All about vlad. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1578–1585, 2013.
  5. Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016.
  6. Speeded-up robust features (surf). Computer vision and image understanding, 110(3):346–359, 2008.
  7. Viewpoint invariant dense matching for visual geolocalization. In IEEE/CVF International Conference on Computer Vision, pages 12169–12178, 2021a.
  8. Rethinking visual geo-localization for large-scale applications. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4878–4888, 2022a.
  9. Deep visual geo-localization benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5396–5407, 2022b.
  10. Eigenplaces: Training viewpoint robust models for visual place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11080–11090, 2023.
  11. Adaptive-attentive geolocalization from few queries: A hybrid approach. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2918–2927, 2021b.
  12. Unifying deep local and global features for image search. In European Conference on Computer Vision, pages 726–743. Springer, 2020.
  13. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678, 2022.
  14. Deep learning features at scale for visual place recognition. In 2017 IEEE international conference on robotics and automation, pages 3223–3230. IEEE, 2017a.
  15. Only look once, mining distinctive landmarks from convnet for visual place recognition. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9–16. IEEE, 2017b.
  16. Fab-map: Probabilistic localization and mapping in the space of appearance. The International Journal of Robotics Research, 27(6):647–665, 2008.
  17. Scalable place recognition under appearance change for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9319–9328, 2019.
  18. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  19. Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  20. Seqnet: Learning descriptors for sequence-based hierarchical place recognition. IEEE Robotics and Automation Letters, 6(3):4305–4312, 2021.
  21. Improving condition-and environment-invariant place recognition with semantic place categorization. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 6863–6870. IEEE, 2017.
  22. Don’t look back: Robustifying place categorization for viewpoint-and condition-invariant place recognition. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3645–3652, 2018.
  23. Self-supervising fine-grained region similarities for large-scale image localization. In European conference on computer vision, pages 369–386. Springer, 2020.
  24. Visual place recognition using hmm sequence matching. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4549–4555. IEEE, 2014.
  25. Hierarchical multi-process fusion for visual place recognition. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 3327–3333. IEEE, 2020.
  26. Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14141–14152, 2021.
  27. Bocnf: efficient image matching with bag of convnet features for scalable and robust visual place recognition. Autonomous Robots, 42(6):1169–1185, 2018.
  28. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  29. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  30. Aggregating local descriptors into a compact image representation. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3304–3311. IEEE, 2010.
  31. Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039, 2022.
  32. Learned contextual feature reweighting for image geo-localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2136–2145, 2017.
  33. Anyloc: Towards universal visual place recognition. arXiv preprint arXiv:2308.00688, 2023.
  34. Contrastive alignment of vision to language through parameter-efficient transfer learning. In The Eleventh International Conference on Learning Representations, 2023.
  35. Predicting good features for image geo-localization using per-bundle vlad. In Proceedings of the IEEE International Conference on Computer Vision, pages 1170–1178, 2015.
  36. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), pages 2169–2178. IEEE, 2006.
  37. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  38. Data-efficient large scale place recognition with graded similarity supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23487–23496, 2023.
  39. Lightweight, viewpoint-invariant visual place recognition in changing environments. IEEE Robotics and Automation Letters, 3(2):957–964, 2018.
  40. Visual place recognition: A survey. ieee transactions on robotics, 32(1):1–19, 2015.
  41. Sta-vpr: Spatio-temporal alignment for visual place recognition. IEEE Robotics and Automation Letters, 6(3):4297–4304, 2021.
  42. Deep homography estimation for visual place recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024a.
  43. Towards seamless adaptation of pre-trained models for visual place recognition. In The Twelfth International Conference on Learning Representations, 2024b.
  44. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
  45. Scalable 6-dof localization on mobile devices. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, pages 268–283. Springer, 2014.
  46. Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights. In 2012 IEEE international conference on robotics and automation, pages 1643–1649. IEEE, 2012.
  47. Semantics-aware visual localization under challenging perceptual conditions. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2614–2620. IEEE, 2017.
  48. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  49. St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35:26462–26477, 2022.
  50. Dual-path adaptation from image to video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2203–2213, 2023.
  51. Attentional pyramid pooling of salient visual residuals for place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 885–894, 2021.
  52. Fine-tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018.
  53. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  54. Are we there yet? challenging seqslam on a 3000 km journey across all four seasons. page 2013, 2013.
  55. On the performance of convnet features for place recognition. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 4297–4304. IEEE, 2015.
  56. Going deeper with convolutions. In IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  57. Visual place recognition with repetitive structures. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 883–890, 2013.
  58. 24/7 place recognition by view synthesis. In IEEE conference on computer vision and pattern recognition, pages 1808–1817, 2015.
  59. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  60. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  61. Transvpr: Transformer-based place recognition with multi-level attention aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13648–13657, 2022a.
  62. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022b.
  63. Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5022–5030, 2019.
  64. Mapillary street-level sequences: A dataset for lifelong place recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2626–2635, 2020.
  65. Localizing discriminative visual landmarks for place recognition. In 2019 International conference on robotics and automation (ICRA), pages 5979–5985. IEEE, 2019.
  66. Probabilistic visual place recognition for hierarchical localization. IEEE Robotics and Automation Letters, 6(2):311–318, 2020.
  67. Side adapter network for open-vocabulary semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945–2954, 2023.
  68. Aim: Adapting image models for efficient video action recognition. 2023.
  69. Amstertime: A visual place recognition benchmark dataset for severe domain shift. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 2749–2755. IEEE, 2022.
  70. A multi-domain feature learning method for visual place recognition. In 2019 International Conference on Robotics and Automation (ICRA), pages 319–324. IEEE, 2019.
  71. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  72. R2former: Unified retrieval and reranking transformer for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19370–19380, 2023.
Citations (13)

Summary

  • The paper introduces a novel representation learning approach that integrates cross-image correlation to improve visual place recognition.
  • It employs a multi-scale convolution-enhanced adaptation to optimize pre-trained models for robust feature extraction.
  • Empirical results, including a 94.5% Recall@1 on Pitts30k with 512-dimensional features, highlight its superior efficiency and accuracy.

Enhancing Visual Place Recognition Through Cross-Image Correlation Awareness: A Deep Dive into CricaVPR

Introduction to CricaVPR

Visual Place Recognition (VPR) remains a pivotal yet challenging task within the computer vision field, particularly pivotal for applications such as augmented reality, robotics, and autonomous navigation. The traditional approach focuses on generating global representations of images to identify locations, however, this method often fails to address the complexities introduced by varying conditions, viewpoints, and perceptual aliasing. To mitigate these issues, our discussion revolves around a novel methodology, CricaVPR (Cross-image Correlation-aware Representation Learning for Visual Place Recognition), which introduces a robust global representation approach by leveraging cross-image correlation awareness.

Unveiling CricaVPR

CricaVPR pushes the boundaries of VPR by introducing a representation learning method that incorporates cross-image variations directly into the feature extraction process. It employs a self-attention mechanism to capture the correlation among multiple images within a batch, including images from the same location captured under different conditions or from varying viewpoints, as well as images from distinct locations. This methodology allows for the exploitation of cross-image variations as a guiding cue for representation learning, aiming to foster more robust and discriminative features.

Multi-Scale Convolution-Enhanced Adaptation

A standout innovation within CricaVPR is its multi-scale convolution-enhanced adaptation technique designed to tailor pre-trained visual foundation models specifically for the VPR task. By integrating multi-scale local information, this method significantly enhances cross-image correlation-aware representation, proving especially advantageous over existing practices that fail to fully adapt pre-trained models for the nuanced needs of VPR.

Performance Benchmarks

Empirical results firmly establish CricaVPR's supremacy over state-of-the-art methods across a multitude of challenging datasets. Noteworthy is its achievement of 94.5% Recall@1 on the Pitts30k dataset utilizing only 512-dimensional compact global features, a feat that underscores the method's efficiency and its ability to significantly reduce training time without compromise on performance.

Implications and Future Directions

The introduction of CricaVPR not only marks a significant advancement in tackling VPR's inherent challenges but also opens avenues for future research. The utilization of cross-image correlation for feature enhancement has demonstrated potential far beyond the initial scope, suggesting its applicability across various tasks within computer vision where condition invariance and robustness against perceptual aliasing are crucial. Moreover, the multi-scale convolution-enhanced adaptation technique presents a novel approach for leveraging pre-trained models, encouraging further exploration into parameter-efficient transfer learning for domain-specific tasks.

Concluding Thoughts

In summation, CricaVPR represents a significant stride towards solving the intricate puzzle of Visual Place Recognition by adeptly addressing the critical challenges of condition variations, viewpoint changes, and perceptual aliasing. Through its innovative use of cross-image correlation and a multi-scale adaptation method, CricaVPR not only sets new benchmarks in VPR performance but also paves the way for future innovations in this dynamic field of paper.