Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PRAM: Place Recognition Anywhere Model for Efficient Visual Localization (2404.07785v1)

Published 11 Apr 2024 in cs.CV and cs.RO

Abstract: Humans localize themselves efficiently in known environments by first recognizing landmarks defined on certain objects and their spatial relationships, and then verifying the location by aligning detailed structures of recognized objects with those in the memory. Inspired by this, we propose the place recognition anywhere model (PRAM) to perform visual localization as efficiently as humans do. PRAM consists of two main components - recognition and registration. In detail, first of all, a self-supervised map-centric landmark definition strategy is adopted, making places in either indoor or outdoor scenes act as unique landmarks. Then, sparse keypoints extracted from images, are utilized as the input to a transformer-based deep neural network for landmark recognition; these keypoints enable PRAM to recognize hundreds of landmarks with high time and memory efficiency. Keypoints along with recognized landmark labels are further used for registration between query images and the 3D landmark map. Different from previous hierarchical methods, PRAM discards global and local descriptors, and reduces over 90% storage. Since PRAM utilizes recognition and landmark-wise verification to replace global reference search and exhaustive matching respectively, it runs 2.4 times faster than prior state-of-the-art approaches. Moreover, PRAM opens new directions for visual localization including multi-modality localization, map-centric feature learning, and hierarchical scene coordinate regression.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in ICCV, 2015.
  2. A. Kendall and R. Cipolla, “Geometric loss functions for camera pose regression with deep learning,” in CVPR, 2017.
  3. F. Xue, X. Wu, S. Cai, and J. Wang, “Learning multi-view camera relocalization with graph neural networks,” in CVPR, 2020.
  4. S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, “MapNet: Geometry-aware learning of maps for camera localization,” in CVPR, 2018.
  5. F. Xue, X. Wang, Z. Yan, Q. Wang, J. Wang, and H. Zha, “Local supports global: Deep camera relocalization with sequence enhancement,” in ICCV, 2019.
  6. E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “DSAC-differentiable RANSAC for camera localization,” in CVPR, 2017.
  7. E. Brachmann and C. Rother, “Learning less is more-6d camera localization via 3d surface regression,” in CVPR, 2018.
  8. I. Budvytis, M. Teichmann, T. Vojir, and R. Cipolla, “Large scale joint semantic re-localisation and scene understanding via globally unique instance coordinate regression,” in BMVC, 2019.
  9. E. Brachmann, T. Cavallari, and V. A. Prisacariu, “Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses,” in CVPR, 2023.
  10. P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From Coarse to Fine: Robust Hierarchical Localization at Large Scale,” in CVPR, 2019.
  11. F. Xue, I. Budvytis, D. O. Reino, and R. Cipolla, “Efficient Large-scale Localization by Global Instance Recognition,” in CVPR, 2022.
  12. T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective prioritized for large-scale image-based localization,” TPAMI, 2016.
  13. C. Toft, E. Stenborg, L. Hammarstrand, L. Brynte, M. Pollefeys, T. Sattler, and F. Kahl, “Semantic match consistency for long-term visual localization,” in ECCV, 2018.
  14. B. Wang, C. Chen, C. X. Lu, P. Zhao, N. Trigoni, and A. Markham, “Atloc: Attention guided camera localization,” in AAAI, 2020.
  15. X. Wu, H. Zhao, S. Li, Y. Cao, and H. Zha, “Sc-wls: Towards interpretable feed-forward camera re-localization,” in ECCV, 2022.
  16. J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” in CVPR, 2013.
  17. X. Li, S. Wang, Y. Zhao, J. Verbeek, and J. Kannala, “Hierarchical scene coordinate classification and regression for visual localization,” in CVPR, 2020.
  18. E. Brachmann and C. Rother, “Visual camera re-localization from rgb and rgb-d images using dsac,” TPAMI, vol. 44, no. 9, pp. 5847–5865, 2022.
  19. V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o (n) solution to the pnp problem,” IJCV, 2009.
  20. M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  21. R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in CVPR, 2016.
  22. S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition,” in CVPR, 2021.
  23. F. Radenović, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no human annotation,” TPAMI, vol. 41, no. 7, pp. 1655–1668, 2018.
  24. P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in CVPR, 2020.
  25. F. Xue, I. Budvytis, and R. Cipolla, “Imp: Iterative matching and pose estimation with adaptive pooling,” in CVPR, 2023.
  26. H. Chen, Z. Luo, J. Zhang, L. Zhou, X. Bai, Z. Hu, C.-L. Tai, and L. Quan, “Learning to match features with seeded graph matching network,” in ICCV, 2021.
  27. Y. Shi, J.-X. Cai, Y. Shavit, T.-J. Mu, W. Feng, and K. Zhang, “ClusterGNN: Cluster-based Coarse-to-Fine Graph Neural Network for Efficient Feature Matching,” in CVPR, 2022.
  28. D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in CVPRW, 2018.
  29. F. Xue, I. Budvytis, and R. Cipolla, “Sfd2: Semantic-guided feature detection and description,” in CVPR, 2023.
  30. D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, 2004.
  31. E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in ICCV, 2011.
  32. J. Revaud, P. Weinzaepfel, C. R. de Souza, and M. Humenberger, “R2D2: Repeatable and reliable detector and descriptor,” in NeurIPS, 2019.
  33. K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invariant feature transform,” in ECCV, 2016.
  34. F. Langer, G. Bae, I. Budvytis, and R. Cipolla, “Sparc: Sparse render-and-compare for cad model alignment in a single rgb image,” in BMVC, 2022.
  35. F. Langer, I. Budvytis, and R. Cipolla, “Sparse multi-object render-and-compare,” in BMVC, 2023.
  36. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
  37. B. Glocker, S. Izadi, J. Shotton, and A. Criminisi, “Real-time rgb-d camera relocalization,” in ISMAR, 2013, pp. 173–179.
  38. J. Valentin, A. Dai, M. Nießner, P. Kohli, P. Torr, S. Izadi, and C. Keskin, “Learning to navigate the energy landscape,” in 3DV, 2016.
  39. T. Sattler, T. Weyand, B. Leibe, and L. Kobbelt, “Image retrieval for image-based localization revisited,” in BMVC, 2012.
  40. X. Li and H. Ling, “Pogo-net: pose graph optimization with graph neural networks,” in ICCV, 2021.
  41. Li, Xinyi and Ling, Haibin, “Gtcar: Graph transformer for camera re-localization,” in ECCV, 2022.
  42. M. O. Turkoglu, E. Brachmann, K. Schindler, G. J. Brostow, and A. Monszpart, “Visual camera re-localization using graph neural networks and relative pose supervision,” in 3DV, 2021.
  43. H. Li, P. Xiong, H. Fan, and J. Sun, “Dfanet: Deep feature aggregation for real-time semantic segmentation,” in CVPR, 2019.
  44. A. Moreau, N. Piasco, D. Tsishkou, B. Stanciulescu, and A. de La Fortelle, “Lens: Localization enhanced by nerf synthesis,” in CoRL, 2022.
  45. T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixe, “Understanding the limitations of cnn-based absolute camera pose regression,” in CVPR, 2019.
  46. N.-D. Duong, A. Kacete, C. Soladie, P.-Y. Richard, and J. Royan, “Accurate sparse feature regression forest learning for real-time camera relocalization,” in 3DV, 2018.
  47. J. Sivic and A. Zisserman, “Efficient visual search of videos cast as text retrieval,” TPAMI, vol. 31, no. 4, pp. 591–606, 2008.
  48. D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” T-RO, vol. 28, no. 5, pp. 1188–1197, 2012.
  49. E. Stenborg, C. Toft, and L. Hammarstrand, “Long-term visual localization using semantically segmented images,” in ICRA, 2018.
  50. T. Shi, S. Shen, X. Gao, and L. Zhu, “Visual localization using sparse semantic 3D map,” in ICIP, 2019.
  51. Z. Xin, Y. Cai, T. Lu, X. Xing, S. Cai, J. Zhang, Y. Yang, and Y. Wang, “Localizing discriminative visual landmarks for place recognition,” in ICRA, 2019.
  52. M. Larsson, E. Stenborg, C. Toft, L. Hammarstrand, T. Sattler, and F. Kahl, “Fine-grained segmentation networks: Self-supervised segmentation for improved long-term visual localization,” in ICCV, 2019.
  53. R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “Slam++: Simultaneous localisation and mapping at the level of objects,” in CVPR, 2013.
  54. V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” TPAMI, 2017.
  55. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018.
  56. R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras,” IEEE TRO, vol. 33, no. 5, pp. 1255–1262, 2017.
  57. F. Xue, X. Wang, S. Li, Q. Wang, J. Wang, and H. Zha, “Beyond tracking: Selecting memory and refining poses for deep visual odometry,” in CVPR, 2019.
  58. S. Li, F. Xue, X. Wang, Z. Yan, and H. Zha, “Sequential adversarial learning for self-supervised deep visual odometry,” in ICCV, 2019.
  59. I. Budvytis, P. Sauer, and R. Cipolla, “Semantic localisation via globally unique instance segmentation,” in BMVC, 2018.
  60. J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in CVPR, 2016.
  61. M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-Net: A trainable CNN for joint description and detection of local features,” in CVPR, 2019.
  62. M. J. Tyszkiewicz, P. Fua, and E. Trulls, “DISK: Learning local features with policy gradient,” in NeurIPS, 2020.
  63. T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data clustering method for very large databases,” ACM sigmod record, vol. 25, no. 2, pp. 103–114, 1996.
  64. D. Arthur and S. Vassilvitskii, “K-means++ the advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007.
  65. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” in ICCV, 2023.
  66. Y. Li, N. Snavely, and D. P. Huttenlocher, “Location recognition using prioritized feature matching,” in ECCV, 2010.
  67. H. Soo Park, Y. Wang, E. Nurvitadhi, J. C. Hoe, Y. Sheikh, and M. Chen, “3d point cloud reduction using mixed-integer quadratic programming,” in CVPRW, 2013.
  68. J. L. Schönberger, M. Pollefeys, A. Geiger, and T. Sattler, “Semantic visual localization,” in CVPR, 2018.
  69. Y. Shavit and Y. Keller, “Camera pose auto-encoders for improving pose regression,” in ECCV, 2022.
  70. Y. Shavit, R. Ferens, and Y. Keller, “Learning multi-scene absolute pose regression with transformers,” in ICCV, 2021.
  71. Z. Huang, H. Zhou, Y. Li, B. Yang, Y. Xu, X. Zhou, H. Bao, G. Zhang, and H. Li, “VS-Net: Voting with segmentation for visual localization,” in CVPR, 2021.
  72. P.-E. Sarlin, A. Unagar, M. Larsson, H. Germain, C. Toft, V. Larsson, M. Pollefeys, V. Lepetit, L. Hammarstrand, F. Kahl et al., “Back to the feature: learning robust camera localization from pixels to pose,” in CVPR, 2021.
  73. M. Ding, Z. Wang, J. Sun, J. Shi, and P. Luo, “Camnet: Coarse-to-fine retrieval for camera re-localization,” in ICCV, 2019.
  74. L. Yang, Z. Bai, C. Tang, H. Li, Y. Furukawa, and P. Tan, “Sanet: Scene agnostic network for camera localization,” in ICCV, 2019.
  75. L. Zhou, Z. Luo, T. Shen, J. Zhang, M. Zhen, Y. Yao, T. Fang, and L. Quan, “Kfnet: Learning temporal camera relocalization using kalman filtering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
  76. J. Liu, Q. Nie, Y. Liu, and C. Wang, “Nerf-loc: Visual localization with conditional neural radiance field,” in ICRA, 2023.
  77. S. Tang, C. Tang, R. Huang, S. Zhu, and P. Tan, “Learning camera localization via dense scene matching,” in CVPR, 2021.
  78. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in NeurIPS, 2019.
  79. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
  80. I. Rocco, R. Arandjelović, and J. Sivic, “Efficient neighbourhood consensus networks via submanifold sparse convolutions,” in ECCV, 2020.
  81. Z. Huang, H. Zhou, Y. Li, B. Yang, Y. Xu, X. Zhou, H. Bao, G. Zhang, and H. Li, “Vs-net: Voting with segmentation for visual localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  82. L. Svärm, O. Enqvist, F. Kahl, and M. Oskarsson, “City-scale localization for cameras with known vertical direction,” TPAMI, 2016.
  83. W. Cheng, W. Lin, K. Chen, and X. Zhang, “Cascaded parallel filtering for memory-efficient image-based localization,” in ICCV, 2019.
  84. T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic et al., “Benchmarking 6dof outdoor visual localization in changing conditions,” in CVPR, 2018.
  85. E. Brachmann and C. Rother, “Expert sample consensus applied to camera re-localization,” in CVPR, 2019.
  86. S. Tang, S. Tang, A. Tagliasacchi, P. Tan, and Y. Furukawa, “Neumap: Neural coordinate mapping by auto-transdecoder for camera localization,” in CVPR, 2023, pp. 929–939.
  87. L. Yang, R. Shrestha, W. Li, S. Liu, G. Zhang, Z. Cui, and P. Tan, “Scenesqueezer: Learning to compress scene for camera relocalization,” in CVPR, 2022.
  88. Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan, “ASLFeat: Learning local features of accurate shape and localization,” in CVPR, 2020.
  89. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.

Summary

  • The paper introduces PRAM, a model that efficiently localizes visual data via a dual-stage landmark recognition and registration process, cutting processing time by 2.4x and storage by over 90%.
  • The paper employs a self-supervised, map-centric approach that defines 3D landmarks on sparse keypoints, eliminating manual labeling and reducing redundant computations.
  • The paper validates PRAM's high accuracy and scalability across multiple datasets, showcasing its versatility in diverse indoor and outdoor environments.

PRAM: Transforming Visual Localization through Place Recognition Anywhere Model

Introduction

Visual localization has been pivotal in advancing applications like augmented/virtual reality (AR/VR), autonomous driving, and robotics. Traditional methods like Absolute Pose Regression (APR), Scene Coordinate Regression (SCR), and Hierarchical Methods (HM) have paved the way for achieving significant milestones. However, these methods exhibit a trade-off between time and memory efficiency against accuracy, especially in large-scale scenes. Drawing inspiration from human landmark recognition and verification, the Place Recognition Anywhere Model (PRAM) introduces a novel paradigm, achieving efficient and accurate visual localization across various environments.

Landmark Recognition and Registration

PRAM distinguishes itself with a two-fold approach: landmark recognition and registration. By adopting a map-centric strategy to define landmarks directly on 3D points rather than objects, it allows for unique landmark identification in both indoor and outdoor scenarios. This method does away with laborious manual labeling, achieving a seamless, self-supervised landmark generation process. For recognition, PRAM utilizes a transformer-based neural network, leveraging sparse keypoints extracted from images. This adjustment not only reduces the time and memory footprint significantly but also retains high recognition accuracy compared to traditional dense pixel methods. The model efficiently narrows down to a coarse location through landmark recognition, followed by a landmark-wise verification for precise localization, running 2.4 times faster and requiring over 90% less storage than existing hierarchical approaches.

Advantages and Contributions

PRAM's methodology introduces several advantages:

  • Efficiency in Large-Scale Scenes: By transforming global reference search into landmark recognition, PRAM demonstrates superior time and memory efficiency.
  • Reduction in Redundant Computations: The model strategically filters potential outliers and performs semantic-aware registration, significantly cutting down unnecessary computations.
  • Flexibility and Extensibility: The framework accommodates multi-modality data, laying groundwork for advancements in visual localization like map-centric feature learning and sparse scene coordinate regression.
  • Significant Memory Savings: PRAM achieves substantial reductions in storage requirements by eliminating the need for storing extensive global and local descriptors.

Implications and Future Directions

The PRAM framework not only sets a new benchmark for efficiency and accuracy in visual localization but also inspires several future research directions. Enhanced landmark definition strategies, exploration into adaptive landmark generation, and integration of multi-modal inputs for improved recognition accuracy are some avenues that hold promise. Furthermore, PRAM's approach to map-centric feature learning and its potential in facilitating large-scale scene coordinate regression present exciting opportunities for the broader AI and computer vision communities to explore.

Experimentation and Results

Evaluated across renowned datasets including 7Scenes, 12Scenes, CambridgeLandmarks, and Aachen Day-Night, PRAM demonstrates commendable performance. Its ability to run significantly faster while using minimal storage and retaining accuracy positions it as a groundbreaking solution in the landscape of visual localization.

Conclusion

In summary, PRAM revolutionizes visual localization by introducing an efficient and accurate place recognition model versatile across different scales and settings. Through sophisticated landmark recognition and registration techniques, it addresses the longstanding challenges of efficiency and scalability that have hindered previous methods. As the research community continues to explore and expand upon the foundations laid by PRAM, the future of visual localization appears both promising and exciting.