Leveraging Neural Radiance Field in Descriptor Synthesis for Keypoints Scene Coordinate Regression (2403.10297v2)
Abstract: Classical structural-based visual localization methods offer high accuracy but face trade-offs in terms of storage, speed, and privacy. A recent innovation, keypoint scene coordinate regression (KSCR) named D2S addresses these issues by leveraging graph attention networks to enhance keypoint relationships and predict their 3D coordinates using a simple multilayer perceptron (MLP). Camera pose is then determined via PnP+RANSAC, using established 2D-3D correspondences. While KSCR achieves competitive results, rivaling state-of-the-art image-retrieval methods like HLoc across multiple benchmarks, its performance is hindered when data samples are limited due to the deep learning model's reliance on extensive data. This paper proposes a solution to this challenge by introducing a pipeline for keypoint descriptor synthesis using Neural Radiance Field (NeRF). By generating novel poses and feeding them into a trained NeRF model to create new views, our approach enhances the KSCR's generalization capabilities in data-scarce environments. The proposed system could significantly improve localization accuracy by up to 50% and cost only a fraction of time for data synthesis. Furthermore, its modular design allows for the integration of multiple NeRFs, offering a versatile and efficient solution for visual localization. The implementation is publicly available at: https://github.com/ais-lab/DescriptorSynthesis4Feat2Map.
- J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4104–4113.
- B.-T. Bui, D.-T. Tran, and J.-H. Lee, “D2S: Representing local descriptors and global scene coordinates for camera relocalization,” Dec. 2023, arXiv:2307.15250 [cs].
- A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2938–2946.
- T. B. Bach, T. T. Dinh, and J.-H. Lee, “FeatLoc: Absolute pose regressor for indoor 2D sparse features with simplistic view synthesizing,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 189, pp. 50–62, July 2022.
- L. Zhou, Z. Luo, T. Shen, J. Zhang, M. Zhen, Y. Yao, T. Fang, and L. Quan, “Kfnet: Learning temporal camera relocalization using kalman filtering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4919–4928.
- Q. Zhou, T. Sattler, M. Pollefeys, and L. Leal-Taixe, “To learn or not to learn: Visual localization from essential matrices,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 3319–3326.
- E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “Dsac-differentiable ransac for camera localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6684–6692.
- E. Brachmann and C. Rother, “Learning less is more-6d camera localization via 3d surface regression,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4654–4662.
- X. Li, S. Wang, Y. Zhao, J. Verbeek, and J. Kannala, “Hierarchical scene coordinate classification and regression for visual localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 983–11 992.
- S. Dong, S. Wang, Y. Zhuang, J. Kannala, M. Pollefeys, and B. Chen, “Visual localization via few-shot scene region classification,” in 2022 International Conference on 3D Vision (3DV). IEEE, 2022, pp. 393–402.
- Z. Kukelova, M. Bujnak, and T. Pajdla, “Real-time solution to the absolute pose problem with unknown radial distortion and focal length,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2816–2823.
- P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From coarse to fine: Robust hierarchical localization at large scale,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 716–12 725.
- B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
- A. Moreau, N. Piasco, D. Tsishkou, B. Stanciulescu, and A. de La Fortelle, “Lens: Localization enhanced by nerf synthesis,” in Conference on Robot Learning. PMLR, 2022, pp. 1347–1356.
- M. Tancik, E. Weber, E. Ng, R. Li, B. Yi, J. Kerr, T. Wang, A. Kristoffersen, J. Austin, K. Salahi, A. Ahuja, D. McAllister, and A. Kanazawa, “Nerfstudio: A Modular Framework for Neural Radiance Field Development,” in Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings, July 2023, pp. 1–12, arXiv:2302.04264 [cs].
- D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperPoint: Self-Supervised Interest Point Detection and Description,” Apr. 2018, arXiv:1712.07629 [cs].
- P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, “Lightglue: Local feature matching at light speed,” arXiv preprint arXiv:2306.13643, 2023.
- J. L. Schönberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pixelwise view selection for unstructured multi-view stereo,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer, 2016, pp. 501–518.
- M. Tyszkiewicz, P. Fua, and E. Trulls, “Disk: Learning local features with policy gradient,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 254–14 265, 2020.
- M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-net: A trainable cnn for joint detection and description of local features,” arXiv preprint arXiv:1905.03561, 2019.
- J. Revaud, P. Weinzaepfel, C. De Souza, N. Pion, G. Csurka, Y. Cabon, and M. Humenberger, “R2d2: repeatable and reliable detector and descriptor,” arXiv preprint arXiv:1906.06195, 2019.
- R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297–5307.
- A. Gordo, J. Almazan, J. Revaud, and D. Larlus, “End-to-end learning of deep visual representations for image retrieval,” International Journal of Computer Vision, vol. 124, no. 2, pp. 237–254, 2017.
- P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947.
- A. Bergamo, S. N. Sinha, and L. Torresani, “Leveraging structure from motion to learn discriminative codebooks for scalable landmark classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 763–770.
- S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, “Geometry-aware learning of maps for camera localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2616–2625.
- B. Wang, C. Chen, C. X. Lu, P. Zhao, N. Trigoni, and A. Markham, “Atloc: Attention guided camera localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 06, 2020, pp. 10 393–10 401.
- T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixe, “Understanding the limitations of cnn-based absolute camera pose regression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3302–3312.
- T. Ng, A. Lopez-Rodriguez, V. Balntas, and K. Mikolajczyk, “Reassessing the limitations of cnn methods for camera pose regression,” arXiv preprint arXiv:2108.07260, 2021.
- E. Brachmann and C. Rother, “Visual camera re-localization from RGB and RGB-D images using DSAC,” TPAMI, 2021.
- F. Pittaluga, S. J. Koppal, S. B. Kang, and S. N. Sinha, “Revealing Scenes by Inverting Structure from Motion Reconstructions,” Apr. 2019, arXiv:1904.03303 [cs].
- J. Zhang, S. Tang, K. Qiu, R. Huang, C. Fang, L. Cui, Z. Dong, S. Zhu, and P. Tan, “Rendernet: Visual relocalization using virtual viewpoints in large-scale indoor environments,” arXiv preprint arXiv:2207.12579, 2022.
- K. Liu, Q. Li, and G. Qiu, “Posegan: A pose-to-image translation framework for camera localization,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 166, pp. 308–315, 2020.
- L. Chen, W. Chen, R. Wang, and M. Pollefeys, “Leveraging neural radiance fields for uncertainty-aware visual localization,” arXiv preprint arXiv:2310.06984, 2023.
- J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan, “Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, Oct. 2021, pp. 5835–5844.
- J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Mip-nerf 360: Unbounded anti-aliased neural radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5470–5479.
- T. Müller, A. Evans, C. Schied, and A. Keller, “Instant Neural Graphics Primitives with a Multiresolution Hash Encoding,” ACM Transactions on Graphics, vol. 41, no. 4, pp. 1–15, July 2022, arXiv:2201.05989 [cs].
- Z. Wang, S. Wu, W. Xie, M. Chen, and V. A. Prisacariu, “Nerf–: Neural radiance fields without known camera parameters,” arXiv preprint arXiv:2102.07064, 2021.
- J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2930–2937.
- J. Valentin, A. Dai, M. Nießner, P. Kohli, P. Torr, S. Izadi, and C. Keskin, “Learning to navigate the energy landscape,” in 2016 Fourth International Conference on 3D Vision (3DV). IEEE, 2016, pp. 323–332.