Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Neural Radiance Field in Descriptor Synthesis for Keypoints Scene Coordinate Regression (2403.10297v2)

Published 15 Mar 2024 in cs.CV

Abstract: Classical structural-based visual localization methods offer high accuracy but face trade-offs in terms of storage, speed, and privacy. A recent innovation, keypoint scene coordinate regression (KSCR) named D2S addresses these issues by leveraging graph attention networks to enhance keypoint relationships and predict their 3D coordinates using a simple multilayer perceptron (MLP). Camera pose is then determined via PnP+RANSAC, using established 2D-3D correspondences. While KSCR achieves competitive results, rivaling state-of-the-art image-retrieval methods like HLoc across multiple benchmarks, its performance is hindered when data samples are limited due to the deep learning model's reliance on extensive data. This paper proposes a solution to this challenge by introducing a pipeline for keypoint descriptor synthesis using Neural Radiance Field (NeRF). By generating novel poses and feeding them into a trained NeRF model to create new views, our approach enhances the KSCR's generalization capabilities in data-scarce environments. The proposed system could significantly improve localization accuracy by up to 50% and cost only a fraction of time for data synthesis. Furthermore, its modular design allows for the integration of multiple NeRFs, offering a versatile and efficient solution for visual localization. The implementation is publicly available at: https://github.com/ais-lab/DescriptorSynthesis4Feat2Map.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4104–4113.
  2. B.-T. Bui, D.-T. Tran, and J.-H. Lee, “D2S: Representing local descriptors and global scene coordinates for camera relocalization,” Dec. 2023, arXiv:2307.15250 [cs].
  3. A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2938–2946.
  4. T. B. Bach, T. T. Dinh, and J.-H. Lee, “FeatLoc: Absolute pose regressor for indoor 2D sparse features with simplistic view synthesizing,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 189, pp. 50–62, July 2022.
  5. L. Zhou, Z. Luo, T. Shen, J. Zhang, M. Zhen, Y. Yao, T. Fang, and L. Quan, “Kfnet: Learning temporal camera relocalization using kalman filtering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4919–4928.
  6. Q. Zhou, T. Sattler, M. Pollefeys, and L. Leal-Taixe, “To learn or not to learn: Visual localization from essential matrices,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 3319–3326.
  7. E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “Dsac-differentiable ransac for camera localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6684–6692.
  8. E. Brachmann and C. Rother, “Learning less is more-6d camera localization via 3d surface regression,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4654–4662.
  9. X. Li, S. Wang, Y. Zhao, J. Verbeek, and J. Kannala, “Hierarchical scene coordinate classification and regression for visual localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 983–11 992.
  10. S. Dong, S. Wang, Y. Zhuang, J. Kannala, M. Pollefeys, and B. Chen, “Visual localization via few-shot scene region classification,” in 2022 International Conference on 3D Vision (3DV).   IEEE, 2022, pp. 393–402.
  11. Z. Kukelova, M. Bujnak, and T. Pajdla, “Real-time solution to the absolute pose problem with unknown radial distortion and focal length,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2816–2823.
  12. P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From coarse to fine: Robust hierarchical localization at large scale,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 716–12 725.
  13. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
  14. A. Moreau, N. Piasco, D. Tsishkou, B. Stanciulescu, and A. de La Fortelle, “Lens: Localization enhanced by nerf synthesis,” in Conference on Robot Learning.   PMLR, 2022, pp. 1347–1356.
  15. M. Tancik, E. Weber, E. Ng, R. Li, B. Yi, J. Kerr, T. Wang, A. Kristoffersen, J. Austin, K. Salahi, A. Ahuja, D. McAllister, and A. Kanazawa, “Nerfstudio: A Modular Framework for Neural Radiance Field Development,” in Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Proceedings, July 2023, pp. 1–12, arXiv:2302.04264 [cs].
  16. D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperPoint: Self-Supervised Interest Point Detection and Description,” Apr. 2018, arXiv:1712.07629 [cs].
  17. P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, “Lightglue: Local feature matching at light speed,” arXiv preprint arXiv:2306.13643, 2023.
  18. J. L. Schönberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pixelwise view selection for unstructured multi-view stereo,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14.   Springer, 2016, pp. 501–518.
  19. M. Tyszkiewicz, P. Fua, and E. Trulls, “Disk: Learning local features with policy gradient,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 254–14 265, 2020.
  20. M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-net: A trainable cnn for joint detection and description of local features,” arXiv preprint arXiv:1905.03561, 2019.
  21. J. Revaud, P. Weinzaepfel, C. De Souza, N. Pion, G. Csurka, Y. Cabon, and M. Humenberger, “R2d2: repeatable and reliable detector and descriptor,” arXiv preprint arXiv:1906.06195, 2019.
  22. R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297–5307.
  23. A. Gordo, J. Almazan, J. Revaud, and D. Larlus, “End-to-end learning of deep visual representations for image retrieval,” International Journal of Computer Vision, vol. 124, no. 2, pp. 237–254, 2017.
  24. P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947.
  25. A. Bergamo, S. N. Sinha, and L. Torresani, “Leveraging structure from motion to learn discriminative codebooks for scalable landmark classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 763–770.
  26. S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, “Geometry-aware learning of maps for camera localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2616–2625.
  27. B. Wang, C. Chen, C. X. Lu, P. Zhao, N. Trigoni, and A. Markham, “Atloc: Attention guided camera localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 06, 2020, pp. 10 393–10 401.
  28. T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixe, “Understanding the limitations of cnn-based absolute camera pose regression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3302–3312.
  29. T. Ng, A. Lopez-Rodriguez, V. Balntas, and K. Mikolajczyk, “Reassessing the limitations of cnn methods for camera pose regression,” arXiv preprint arXiv:2108.07260, 2021.
  30. E. Brachmann and C. Rother, “Visual camera re-localization from RGB and RGB-D images using DSAC,” TPAMI, 2021.
  31. F. Pittaluga, S. J. Koppal, S. B. Kang, and S. N. Sinha, “Revealing Scenes by Inverting Structure from Motion Reconstructions,” Apr. 2019, arXiv:1904.03303 [cs].
  32. J. Zhang, S. Tang, K. Qiu, R. Huang, C. Fang, L. Cui, Z. Dong, S. Zhu, and P. Tan, “Rendernet: Visual relocalization using virtual viewpoints in large-scale indoor environments,” arXiv preprint arXiv:2207.12579, 2022.
  33. K. Liu, Q. Li, and G. Qiu, “Posegan: A pose-to-image translation framework for camera localization,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 166, pp. 308–315, 2020.
  34. L. Chen, W. Chen, R. Wang, and M. Pollefeys, “Leveraging neural radiance fields for uncertainty-aware visual localization,” arXiv preprint arXiv:2310.06984, 2023.
  35. J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan, “Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV).   Montreal, QC, Canada: IEEE, Oct. 2021, pp. 5835–5844.
  36. J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Mip-nerf 360: Unbounded anti-aliased neural radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5470–5479.
  37. T. Müller, A. Evans, C. Schied, and A. Keller, “Instant Neural Graphics Primitives with a Multiresolution Hash Encoding,” ACM Transactions on Graphics, vol. 41, no. 4, pp. 1–15, July 2022, arXiv:2201.05989 [cs].
  38. Z. Wang, S. Wu, W. Xie, M. Chen, and V. A. Prisacariu, “Nerf–: Neural radiance fields without known camera parameters,” arXiv preprint arXiv:2102.07064, 2021.
  39. J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2930–2937.
  40. J. Valentin, A. Dai, M. Nießner, P. Kohli, P. Torr, S. Izadi, and C. Keskin, “Learning to navigate the energy landscape,” in 2016 Fourth International Conference on 3D Vision (3DV).   IEEE, 2016, pp. 323–332.

Summary

  • The paper presents a novel pipeline that uses Neural Radiance Fields for synthesizing keypoint descriptors in data-sparse scenarios.
  • The approach integrates NeRF with scene coordinate regression techniques, demonstrating up to a 50% improvement in localization accuracy.
  • Experiments on 7Scenes and 12Scenes validate the method's performance and potential for scalable visual localization.

Leveraging Neural Radiance Field in Descriptor Synthesis for Keypoints Scene Coordinate Regression

Introduction

Visual localization plays a pivotal role in numerous fields such as robotics, augmented reality, and computer vision. It entails determining the camera's position and orientation relative to a scene from a given image. Classical methods, despite their accuracy, grapple with challenges related to scalability, privacy, and the need for substantial storage. On the other hand, learning-based methods and scene coordinate regression (SCR) offer promising solutions by alleviating some of these issues. The D2S method integrates a Graph Neural Network (GNN) with a Multilayer Perceptron (MLP) for this purpose but struggles with performance under limited data. This paper introduces a novel pipeline utilizing Neural Radiance Fields (NeRF) for keypoint descriptor synthesis to enhance D2S's performance in data-deficient scenarios. The approach demonstrates notable improvements in localization accuracy, exhibiting up to a 50% accuracy increase under constrained data conditions.

Related Works

The groundwork in visual localization ranges from classical structure-based methods to learning-based and SCR strategies. Classical approaches are noted for their accuracy but fall short regarding storage and privacy concerns. Learning-based methods, though efficient in storage and privacy, often lag in accuracy. SCR approaches aim to combine the strengths of both worlds by accurately predicting 3D coordinates before camera pose estimation. The D2S method represents a step forward in SCR by leveraging keypoint relationships through a GNN and an MLP for pose estimation. However, its dependence on large data volumes for effective training remains a hindrance.

Methodology

The newly proposed descriptor synthesis pipeline integrates seamlessly with D2S and employs NeRF for generating novel views from sparse data sets. The method comprises of several steps:

  1. NeRF Training: An advanced NeRF model, Nerfacto, learns the implicit scene representation from a limited dataset, synthesizing high-quality images conditioned by camera poses.
  2. Camera Pose Synthesis: Utilizing spherical linear interpolation on camera poses derived from the training dataset, the pipeline creates novel camera poses to facilitate view generation.
  3. Novel View Synthesis: Novel views are synthesized using the trained NeRF model, conditioned on the newly generated camera poses.
  4. Descriptor Matching and Joint Training: Features are extracted from synthesized views and matched with existing ones to enrich the training dataset for KSCR, improving D2S's pose estimation capabilities.

Experiments

The experiments conducted on the 7Scenes and 12Scenes datasets demonstrate significant improvements when utilizing the proposed pipeline. By employing the Nerfacto model for rapid scene understanding and LightGlue for efficient feature matching, the method not only mitigates the data scarcity issue but also boosts the performance of D2S in visual localization tasks. The approach's efficacy is highlighted by its superior performance against existing SCR methods and even learning-based approaches in scenarios with limited training data.

Conclusions

The integration of NeRF in keypoint descriptor synthesis presents a substantial advancement in enhancing the performance of SCR methods like D2S, particularly in data-sparse scenarios. This pipeline not only buttresses the generalization abilities of D2S but also opens paths for further exploration into efficient and scalable visual localization techniques suitable for real-world applications. As research in Neural Rendering continues to evolve, there's ample opportunity for optimizing and extending this pipeline to overcome current limitations, particularly in dynamic and large-scale environments.