PRAM: Place Recognition Anywhere Model for Efficient Visual Localization (2404.07785v1)
Abstract: Humans localize themselves efficiently in known environments by first recognizing landmarks defined on certain objects and their spatial relationships, and then verifying the location by aligning detailed structures of recognized objects with those in the memory. Inspired by this, we propose the place recognition anywhere model (PRAM) to perform visual localization as efficiently as humans do. PRAM consists of two main components - recognition and registration. In detail, first of all, a self-supervised map-centric landmark definition strategy is adopted, making places in either indoor or outdoor scenes act as unique landmarks. Then, sparse keypoints extracted from images, are utilized as the input to a transformer-based deep neural network for landmark recognition; these keypoints enable PRAM to recognize hundreds of landmarks with high time and memory efficiency. Keypoints along with recognized landmark labels are further used for registration between query images and the 3D landmark map. Different from previous hierarchical methods, PRAM discards global and local descriptors, and reduces over 90% storage. Since PRAM utilizes recognition and landmark-wise verification to replace global reference search and exhaustive matching respectively, it runs 2.4 times faster than prior state-of-the-art approaches. Moreover, PRAM opens new directions for visual localization including multi-modality localization, map-centric feature learning, and hierarchical scene coordinate regression.
- A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in ICCV, 2015.
- A. Kendall and R. Cipolla, “Geometric loss functions for camera pose regression with deep learning,” in CVPR, 2017.
- F. Xue, X. Wu, S. Cai, and J. Wang, “Learning multi-view camera relocalization with graph neural networks,” in CVPR, 2020.
- S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, “MapNet: Geometry-aware learning of maps for camera localization,” in CVPR, 2018.
- F. Xue, X. Wang, Z. Yan, Q. Wang, J. Wang, and H. Zha, “Local supports global: Deep camera relocalization with sequence enhancement,” in ICCV, 2019.
- E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “DSAC-differentiable RANSAC for camera localization,” in CVPR, 2017.
- E. Brachmann and C. Rother, “Learning less is more-6d camera localization via 3d surface regression,” in CVPR, 2018.
- I. Budvytis, M. Teichmann, T. Vojir, and R. Cipolla, “Large scale joint semantic re-localisation and scene understanding via globally unique instance coordinate regression,” in BMVC, 2019.
- E. Brachmann, T. Cavallari, and V. A. Prisacariu, “Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses,” in CVPR, 2023.
- P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From Coarse to Fine: Robust Hierarchical Localization at Large Scale,” in CVPR, 2019.
- F. Xue, I. Budvytis, D. O. Reino, and R. Cipolla, “Efficient Large-scale Localization by Global Instance Recognition,” in CVPR, 2022.
- T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective prioritized for large-scale image-based localization,” TPAMI, 2016.
- C. Toft, E. Stenborg, L. Hammarstrand, L. Brynte, M. Pollefeys, T. Sattler, and F. Kahl, “Semantic match consistency for long-term visual localization,” in ECCV, 2018.
- B. Wang, C. Chen, C. X. Lu, P. Zhao, N. Trigoni, and A. Markham, “Atloc: Attention guided camera localization,” in AAAI, 2020.
- X. Wu, H. Zhao, S. Li, Y. Cao, and H. Zha, “Sc-wls: Towards interpretable feed-forward camera re-localization,” in ECCV, 2022.
- J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” in CVPR, 2013.
- X. Li, S. Wang, Y. Zhao, J. Verbeek, and J. Kannala, “Hierarchical scene coordinate classification and regression for visual localization,” in CVPR, 2020.
- E. Brachmann and C. Rother, “Visual camera re-localization from rgb and rgb-d images using dsac,” TPAMI, vol. 44, no. 9, pp. 5847–5865, 2022.
- V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o (n) solution to the pnp problem,” IJCV, 2009.
- M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
- R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in CVPR, 2016.
- S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition,” in CVPR, 2021.
- F. Radenović, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no human annotation,” TPAMI, vol. 41, no. 7, pp. 1655–1668, 2018.
- P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in CVPR, 2020.
- F. Xue, I. Budvytis, and R. Cipolla, “Imp: Iterative matching and pose estimation with adaptive pooling,” in CVPR, 2023.
- H. Chen, Z. Luo, J. Zhang, L. Zhou, X. Bai, Z. Hu, C.-L. Tai, and L. Quan, “Learning to match features with seeded graph matching network,” in ICCV, 2021.
- Y. Shi, J.-X. Cai, Y. Shavit, T.-J. Mu, W. Feng, and K. Zhang, “ClusterGNN: Cluster-based Coarse-to-Fine Graph Neural Network for Efficient Feature Matching,” in CVPR, 2022.
- D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in CVPRW, 2018.
- F. Xue, I. Budvytis, and R. Cipolla, “Sfd2: Semantic-guided feature detection and description,” in CVPR, 2023.
- D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, 2004.
- E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in ICCV, 2011.
- J. Revaud, P. Weinzaepfel, C. R. de Souza, and M. Humenberger, “R2D2: Repeatable and reliable detector and descriptor,” in NeurIPS, 2019.
- K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invariant feature transform,” in ECCV, 2016.
- F. Langer, G. Bae, I. Budvytis, and R. Cipolla, “Sparc: Sparse render-and-compare for cad model alignment in a single rgb image,” in BMVC, 2022.
- F. Langer, I. Budvytis, and R. Cipolla, “Sparse multi-object render-and-compare,” in BMVC, 2023.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
- B. Glocker, S. Izadi, J. Shotton, and A. Criminisi, “Real-time rgb-d camera relocalization,” in ISMAR, 2013, pp. 173–179.
- J. Valentin, A. Dai, M. Nießner, P. Kohli, P. Torr, S. Izadi, and C. Keskin, “Learning to navigate the energy landscape,” in 3DV, 2016.
- T. Sattler, T. Weyand, B. Leibe, and L. Kobbelt, “Image retrieval for image-based localization revisited,” in BMVC, 2012.
- X. Li and H. Ling, “Pogo-net: pose graph optimization with graph neural networks,” in ICCV, 2021.
- Li, Xinyi and Ling, Haibin, “Gtcar: Graph transformer for camera re-localization,” in ECCV, 2022.
- M. O. Turkoglu, E. Brachmann, K. Schindler, G. J. Brostow, and A. Monszpart, “Visual camera re-localization using graph neural networks and relative pose supervision,” in 3DV, 2021.
- H. Li, P. Xiong, H. Fan, and J. Sun, “Dfanet: Deep feature aggregation for real-time semantic segmentation,” in CVPR, 2019.
- A. Moreau, N. Piasco, D. Tsishkou, B. Stanciulescu, and A. de La Fortelle, “Lens: Localization enhanced by nerf synthesis,” in CoRL, 2022.
- T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixe, “Understanding the limitations of cnn-based absolute camera pose regression,” in CVPR, 2019.
- N.-D. Duong, A. Kacete, C. Soladie, P.-Y. Richard, and J. Royan, “Accurate sparse feature regression forest learning for real-time camera relocalization,” in 3DV, 2018.
- J. Sivic and A. Zisserman, “Efficient visual search of videos cast as text retrieval,” TPAMI, vol. 31, no. 4, pp. 591–606, 2008.
- D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” T-RO, vol. 28, no. 5, pp. 1188–1197, 2012.
- E. Stenborg, C. Toft, and L. Hammarstrand, “Long-term visual localization using semantically segmented images,” in ICRA, 2018.
- T. Shi, S. Shen, X. Gao, and L. Zhu, “Visual localization using sparse semantic 3D map,” in ICIP, 2019.
- Z. Xin, Y. Cai, T. Lu, X. Xing, S. Cai, J. Zhang, Y. Yang, and Y. Wang, “Localizing discriminative visual landmarks for place recognition,” in ICRA, 2019.
- M. Larsson, E. Stenborg, C. Toft, L. Hammarstrand, T. Sattler, and F. Kahl, “Fine-grained segmentation networks: Self-supervised segmentation for improved long-term visual localization,” in ICCV, 2019.
- R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “Slam++: Simultaneous localisation and mapping at the level of objects,” in CVPR, 2013.
- V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” TPAMI, 2017.
- L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018.
- R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras,” IEEE TRO, vol. 33, no. 5, pp. 1255–1262, 2017.
- F. Xue, X. Wang, S. Li, Q. Wang, J. Wang, and H. Zha, “Beyond tracking: Selecting memory and refining poses for deep visual odometry,” in CVPR, 2019.
- S. Li, F. Xue, X. Wang, Z. Yan, and H. Zha, “Sequential adversarial learning for self-supervised deep visual odometry,” in ICCV, 2019.
- I. Budvytis, P. Sauer, and R. Cipolla, “Semantic localisation via globally unique instance segmentation,” in BMVC, 2018.
- J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in CVPR, 2016.
- M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-Net: A trainable CNN for joint description and detection of local features,” in CVPR, 2019.
- M. J. Tyszkiewicz, P. Fua, and E. Trulls, “DISK: Learning local features with policy gradient,” in NeurIPS, 2020.
- T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data clustering method for very large databases,” ACM sigmod record, vol. 25, no. 2, pp. 103–114, 1996.
- D. Arthur and S. Vassilvitskii, “K-means++ the advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” in ICCV, 2023.
- Y. Li, N. Snavely, and D. P. Huttenlocher, “Location recognition using prioritized feature matching,” in ECCV, 2010.
- H. Soo Park, Y. Wang, E. Nurvitadhi, J. C. Hoe, Y. Sheikh, and M. Chen, “3d point cloud reduction using mixed-integer quadratic programming,” in CVPRW, 2013.
- J. L. Schönberger, M. Pollefeys, A. Geiger, and T. Sattler, “Semantic visual localization,” in CVPR, 2018.
- Y. Shavit and Y. Keller, “Camera pose auto-encoders for improving pose regression,” in ECCV, 2022.
- Y. Shavit, R. Ferens, and Y. Keller, “Learning multi-scene absolute pose regression with transformers,” in ICCV, 2021.
- Z. Huang, H. Zhou, Y. Li, B. Yang, Y. Xu, X. Zhou, H. Bao, G. Zhang, and H. Li, “VS-Net: Voting with segmentation for visual localization,” in CVPR, 2021.
- P.-E. Sarlin, A. Unagar, M. Larsson, H. Germain, C. Toft, V. Larsson, M. Pollefeys, V. Lepetit, L. Hammarstrand, F. Kahl et al., “Back to the feature: learning robust camera localization from pixels to pose,” in CVPR, 2021.
- M. Ding, Z. Wang, J. Sun, J. Shi, and P. Luo, “Camnet: Coarse-to-fine retrieval for camera re-localization,” in ICCV, 2019.
- L. Yang, Z. Bai, C. Tang, H. Li, Y. Furukawa, and P. Tan, “Sanet: Scene agnostic network for camera localization,” in ICCV, 2019.
- L. Zhou, Z. Luo, T. Shen, J. Zhang, M. Zhen, Y. Yao, T. Fang, and L. Quan, “Kfnet: Learning temporal camera relocalization using kalman filtering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
- J. Liu, Q. Nie, Y. Liu, and C. Wang, “Nerf-loc: Visual localization with conditional neural radiance field,” in ICRA, 2023.
- S. Tang, C. Tang, R. Huang, S. Zhu, and P. Tan, “Learning camera localization via dense scene matching,” in CVPR, 2021.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in NeurIPS, 2019.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
- I. Rocco, R. Arandjelović, and J. Sivic, “Efficient neighbourhood consensus networks via submanifold sparse convolutions,” in ECCV, 2020.
- Z. Huang, H. Zhou, Y. Li, B. Yang, Y. Xu, X. Zhou, H. Bao, G. Zhang, and H. Li, “Vs-net: Voting with segmentation for visual localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- L. Svärm, O. Enqvist, F. Kahl, and M. Oskarsson, “City-scale localization for cameras with known vertical direction,” TPAMI, 2016.
- W. Cheng, W. Lin, K. Chen, and X. Zhang, “Cascaded parallel filtering for memory-efficient image-based localization,” in ICCV, 2019.
- T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic et al., “Benchmarking 6dof outdoor visual localization in changing conditions,” in CVPR, 2018.
- E. Brachmann and C. Rother, “Expert sample consensus applied to camera re-localization,” in CVPR, 2019.
- S. Tang, S. Tang, A. Tagliasacchi, P. Tan, and Y. Furukawa, “Neumap: Neural coordinate mapping by auto-transdecoder for camera localization,” in CVPR, 2023, pp. 929–939.
- L. Yang, R. Shrestha, W. Li, S. Liu, G. Zhang, Z. Cui, and P. Tan, “Scenesqueezer: Learning to compress scene for camera relocalization,” in CVPR, 2022.
- Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan, “ASLFeat: Learning local features of accurate shape and localization,” in CVPR, 2020.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.