CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition (2402.19231v2)
Abstract: Over the past decade, most methods in visual place recognition (VPR) have used neural networks to produce feature representations. These networks typically produce a global representation of a place image using only this image itself and neglect the cross-image variations (e.g. viewpoint and illumination), which limits their robustness in challenging scenes. In this paper, we propose a robust global representation method with cross-image correlation awareness for VPR, named CricaVPR. Our method uses the attention mechanism to correlate multiple images within a batch. These images can be taken in the same place with different conditions or viewpoints, or even captured from different places. Therefore, our method can utilize the cross-image variations as a cue to guide the representation learning, which ensures more robust features are produced. To further facilitate the robustness, we propose a multi-scale convolution-enhanced adaptation method to adapt pre-trained visual foundation models to the VPR task, which introduces the multi-scale local information to further enhance the cross-image correlation-aware representation. Experimental results show that our method outperforms state-of-the-art methods by a large margin with significantly less training time. The code is released at https://github.com/Lu-Feng/CricaVPR.
- Gsv-cities: Toward appropriate supervised visual place recognition. Neurocomputing, 513:194–203, 2022.
- Mixvpr: Feature mixing for visual place recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2998–3007, 2023.
- Fast and incremental method for loop-closure detection using bags of visual words. IEEE transactions on robotics, 24(5):1027–1037, 2008.
- All about vlad. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1578–1585, 2013.
- Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016.
- Speeded-up robust features (surf). Computer vision and image understanding, 110(3):346–359, 2008.
- Viewpoint invariant dense matching for visual geolocalization. In IEEE/CVF International Conference on Computer Vision, pages 12169–12178, 2021a.
- Rethinking visual geo-localization for large-scale applications. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4878–4888, 2022a.
- Deep visual geo-localization benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5396–5407, 2022b.
- Eigenplaces: Training viewpoint robust models for visual place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11080–11090, 2023.
- Adaptive-attentive geolocalization from few queries: A hybrid approach. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2918–2927, 2021b.
- Unifying deep local and global features for image search. In European Conference on Computer Vision, pages 726–743. Springer, 2020.
- Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678, 2022.
- Deep learning features at scale for visual place recognition. In 2017 IEEE international conference on robotics and automation, pages 3223–3230. IEEE, 2017a.
- Only look once, mining distinctive landmarks from convnet for visual place recognition. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9–16. IEEE, 2017b.
- Fab-map: Probabilistic localization and mapping in the space of appearance. The International Journal of Robotics Research, 27(6):647–665, 2008.
- Scalable place recognition under appearance change for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9319–9328, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
- Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
- Seqnet: Learning descriptors for sequence-based hierarchical place recognition. IEEE Robotics and Automation Letters, 6(3):4305–4312, 2021.
- Improving condition-and environment-invariant place recognition with semantic place categorization. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 6863–6870. IEEE, 2017.
- Don’t look back: Robustifying place categorization for viewpoint-and condition-invariant place recognition. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3645–3652, 2018.
- Self-supervising fine-grained region similarities for large-scale image localization. In European conference on computer vision, pages 369–386. Springer, 2020.
- Visual place recognition using hmm sequence matching. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4549–4555. IEEE, 2014.
- Hierarchical multi-process fusion for visual place recognition. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 3327–3333. IEEE, 2020.
- Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14141–14152, 2021.
- Bocnf: efficient image matching with bag of convnet features for scalable and robust visual place recognition. Autonomous Robots, 42(6):1169–1185, 2018.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Aggregating local descriptors into a compact image representation. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3304–3311. IEEE, 2010.
- Convolutional bypasses are better vision transformer adapters. arXiv preprint arXiv:2207.07039, 2022.
- Learned contextual feature reweighting for image geo-localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2136–2145, 2017.
- Anyloc: Towards universal visual place recognition. arXiv preprint arXiv:2308.00688, 2023.
- Contrastive alignment of vision to language through parameter-efficient transfer learning. In The Eleventh International Conference on Learning Representations, 2023.
- Predicting good features for image geo-localization using per-bundle vlad. In Proceedings of the IEEE International Conference on Computer Vision, pages 1170–1178, 2015.
- Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), pages 2169–2178. IEEE, 2006.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Data-efficient large scale place recognition with graded similarity supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23487–23496, 2023.
- Lightweight, viewpoint-invariant visual place recognition in changing environments. IEEE Robotics and Automation Letters, 3(2):957–964, 2018.
- Visual place recognition: A survey. ieee transactions on robotics, 32(1):1–19, 2015.
- Sta-vpr: Spatio-temporal alignment for visual place recognition. IEEE Robotics and Automation Letters, 6(3):4297–4304, 2021.
- Deep homography estimation for visual place recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024a.
- Towards seamless adaptation of pre-trained models for visual place recognition. In The Twelfth International Conference on Learning Representations, 2024b.
- 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
- Scalable 6-dof localization on mobile devices. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, pages 268–283. Springer, 2014.
- Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights. In 2012 IEEE international conference on robotics and automation, pages 1643–1649. IEEE, 2012.
- Semantics-aware visual localization under challenging perceptual conditions. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2614–2620. IEEE, 2017.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35:26462–26477, 2022.
- Dual-path adaptation from image to video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2203–2213, 2023.
- Attentional pyramid pooling of salient visual residuals for place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 885–894, 2021.
- Fine-tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Are we there yet? challenging seqslam on a 3000 km journey across all four seasons. page 2013, 2013.
- On the performance of convnet features for place recognition. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 4297–4304. IEEE, 2015.
- Going deeper with convolutions. In IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
- Visual place recognition with repetitive structures. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 883–890, 2013.
- 24/7 place recognition by view synthesis. In IEEE conference on computer vision and pattern recognition, pages 1808–1817, 2015.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Transvpr: Transformer-based place recognition with multi-level attention aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13648–13657, 2022a.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022b.
- Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5022–5030, 2019.
- Mapillary street-level sequences: A dataset for lifelong place recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2626–2635, 2020.
- Localizing discriminative visual landmarks for place recognition. In 2019 International conference on robotics and automation (ICRA), pages 5979–5985. IEEE, 2019.
- Probabilistic visual place recognition for hierarchical localization. IEEE Robotics and Automation Letters, 6(2):311–318, 2020.
- Side adapter network for open-vocabulary semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945–2954, 2023.
- Aim: Adapting image models for efficient video action recognition. 2023.
- Amstertime: A visual place recognition benchmark dataset for severe domain shift. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 2749–2755. IEEE, 2022.
- A multi-domain feature learning method for visual place recognition. In 2019 International Conference on Robotics and Automation (ICRA), pages 319–324. IEEE, 2019.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- R2former: Unified retrieval and reranking transformer for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19370–19380, 2023.