Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Produce Semi-dense Correspondences for Visual Localization (2402.08359v2)

Published 13 Feb 2024 in cs.CV

Abstract: This study addresses the challenge of performing visual localization in demanding conditions such as night-time scenarios, adverse weather, and seasonal changes. While many prior studies have focused on improving image-matching performance to facilitate reliable dense keypoint matching between images, existing methods often heavily rely on predefined feature points on a reconstructed 3D model. Consequently, they tend to overlook unobserved keypoints during the matching process. Therefore, dense keypoint matches are not fully exploited, leading to a notable reduction in accuracy, particularly in noisy scenes. To tackle this issue, we propose a novel localization method that extracts reliable semi-dense 2D-3D matching points based on dense keypoint matches. This approach involves regressing semi-dense 2D keypoints into 3D scene coordinates using a point inference network. The network utilizes both geometric and visual cues to effectively infer 3D coordinates for unobserved keypoints from the observed ones. The abundance of matching information significantly enhances the accuracy of camera pose estimation, even in scenarios involving noisy or sparse 3D models. Comprehensive evaluations demonstrate that the proposed method outperforms other methods in challenging scenes and achieves competitive results in large-scale visual localization benchmarks. The code will be available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016.
  2. Visual topometric localization. In 2011 IEEE Intelligent vehicles symposium (IV), pages 794–799. IEEE, 2011.
  3. Graph-cut ransac. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6733–6741, 2018.
  4. Magsac++, a fast, reliable and accurate robust estimator. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1304–1312, 2020.
  5. Rethinking visual geo-localization for large-scale applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4878–4888, 2022.
  6. Learning less is more-6d camera localization via 3d surface regression. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4654–4662, 2018.
  7. Expert sample consensus applied to camera re-localization. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 7524–7533, 2019.
  8. Visual camera re-localization from rgb and rgb-d images using dsac. IEEE transactions on pattern analysis and machine intelligence, 44(9):5847–5865, 2021.
  9. Dsac-differentiable ransac for camera localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6684–6692, 2017.
  10. Accelerated coordinate encoding: Learning to relocalize in minutes using rgb and poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5044–5053, 2023.
  11. Landscapear: Large scale outdoor augmented reality by matching photographs with terrain models using learned descriptors. In European Conference on Computer Vision, pages 295–312. Springer, 2020.
  12. Hybrid scene compression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7653–7662, 2019.
  13. Let’s take this online: Adapting scene coordinate regression network predictions for online rgb-d camera relocalisation. In 2019 International Conference on 3D Vision (3DV), pages 564–573. IEEE, 2019a.
  14. Real-time rgb-d camera pose estimation in novel scenes using a relocalisation cascade. IEEE transactions on pattern analysis and machine intelligence, 42(10):2465–2477, 2019b.
  15. Aspanformer: Detector-free image matching with adaptive span transformer. In European Conference on Computer Vision, pages 20–36. Springer, 2022.
  16. Cascaded parallel filtering for memory-efficient image-based localization. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1032–1041, 2019.
  17. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017.
  18. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018.
  19. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  20. D2-net: A trainable cnn for joint detection and description of local features. arXiv preprint arXiv:1905.03561, 2019.
  21. A paradigm for model fitting with applications to image analysis and automated cartography (reprinted in readings in computer vision, ed. ma fischler. Comm. ACM, 24(6):381–395, 1981.
  22. Complete solution classification for the perspective-three-point problem. IEEE transactions on pattern analysis and machine intelligence, 25(8):930–943, 2003.
  23. S2dnet: Learning image features for accurate sparse-to-dense matching. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 626–643. Springer, 2020.
  24. Topicfm: Robust and interpretable topic-assisted feature matching. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2447–2455, 2023.
  25. Robust image retrieval-based visual localization using kapture. arXiv preprint arXiv:2007.13867, 2020.
  26. Geometric loss functions for camera pose regression with deep learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5974–5983, 2017.
  27. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015.
  28. Ep2p-loc: End-to-end 3d point to 2d pixel localization for large-scale visual localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21527–21537, 2023.
  29. A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation. In CVPR 2011, pages 2969–2976. IEEE, 2011.
  30. Hierarchical scene coordinate classification and regression for visual localization. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11980–11989, 2019.
  31. Worldwide pose estimation using 3d point clouds. In European conference on computer vision, pages 15–29. Springer, 2012.
  32. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018.
  33. Real-time image-based 6-dof localization in large-scale environments. In 2012 IEEE conference on computer vision and pattern recognition, pages 1043–1050. IEEE, 2012.
  34. David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
  35. Get out of my lab: Large-scale, real-time visual-inertial localization. In Robotics: Science and Systems, page 1, 2015.
  36. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research, 36(1):3–15, 2017.
  37. Scalable 6-dof localization on mobile devices. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, pages 268–283. Springer, 2014.
  38. Meshloc: Mesh-based visual localization. In European Conference on Computer Vision, pages 589–609. Springer, 2022.
  39. Visual localization using imperfect 3d models from the internet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13175–13186, 2023.
  40. Lambda twist: An accurate fast robust perspective three point (p3p) solver. In Proceedings of the European conference on computer vision (ECCV), pages 318–332, 2018.
  41. R2d2: Reliable and repeatable detector and descriptor. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019.
  42. From coarse to fine: Robust hierarchical localization at large scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12716–12725, 2019.
  43. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020.
  44. Back to the feature: Learning robust camera localization from pixels to pose. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3247–3257, 2021.
  45. Improving image-based localization by active correspondence search. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part I 12, pages 752–765. Springer, 2012a.
  46. Image retrieval for image-based localization revisited. In BMVC, page 4, 2012b.
  47. Hyperpoints and fine vocabularies for large-scale location recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 2102–2110, 2015.
  48. Efficient & effective prioritized matching for large-scale image-based localization. IEEE transactions on pattern analysis and machine intelligence, 39(9):1744–1756, 2016.
  49. Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8601–8610, 2018.
  50. City-scale location recognition. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–7. IEEE, 2007.
  51. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
  52. Semantic visual localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6896–6906, 2018.
  53. Scene coordinate regression forests for camera relocalization in rgb-d images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2930–2937, 2013.
  54. Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931, 2021.
  55. City-scale localization for cameras with known vertical direction. IEEE transactions on pattern analysis and machine intelligence, 39(7):1455–1461, 2016.
  56. Inloc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7199–7209, 2018.
  57. Learning camera localization via dense scene matching. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1831–1841, 2021.
  58. Neumap: Neural coordinate mapping by auto-transdecoder for camera localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 929–939, 2023.
  59. Long-term visual localization revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(4):2074–2088, 2020.
  60. 24/7 place recognition by view synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1808–1817, 2015.
  61. Learning to navigate the energy landscape. 2016 Fourth International Conference on 3D Vision (3DV), pages 323–332, 2016.
  62. Image-based localization using lstms for structured feature correlation. In Proceedings of the IEEE international conference on computer vision, pages 627–637, 2017.
  63. Lrru: Long-short range recurrent updating networks for depth completion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9422–9432, 2023.
  64. Changchang Wu. Towards linear-time incremental structure from motion. In 2013 International Conference on 3D Vision-3DV 2013, pages 127–134. IEEE, 2013.
  65. Efficient large-scale localization by global instance recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17348–17357, 2022.
  66. Sfd2: Semantic-guided feature detection and description. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5216, 2023.
  67. Sanet: Scene agnostic network for camera localization. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 42–51, 2019.
  68. Scenesqueezer: Learning to compress scene for camera relocalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8259–8268, 2022.
  69. Camera pose voting for large-scale image-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 2704–2712, 2015.
  70. Completionformer: Depth completion with convolutions and vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18527–18536, 2023.
  71. Is geometry enough for matching in visual localization? In European Conference on Computer Vision, pages 407–425. Springer, 2022.

Summary

  • The paper introduces a novel method that directly converts semi-dense 2D-2D matches into robust 2D-3D correspondences, enhancing pose estimation in challenging environments.
  • It employs a Point Inference Network to leverage geometric and visual cues, accurately predicting 3D scene coordinates from detected 2D keypoints.
  • It integrates a Confidence-based Point Aggregation module to reduce outliers, consistently outperforming existing methods in noisy and sparse conditions.

Enhancing Visual Localization in Challenging Conditions with DeViLoc

Introduction to DeViLoc

The quest for robust visual localization has led to the development of a variety of strategies aimed at accurately determining a camera's position and orientation within a scene. Among these, structure-based methods have shown promising outcomes but often grapple with noisy and sparse 3D point clouds which can significantly hinder their performance. Addressing these limitations, a novel framework, DeViLoc (semi-Dense Visual Localization), emerges as a potent solution for generating reliable semi-dense 2D-3D correspondences, even in challenging environments characterized by sparse or noisy 3D models.

Key Contributions of DeViLoc

DeViLoc introduces distinct innovations to the field of visual localization:

  • Direct Conversion of Semi-dense Matches: Unlike traditional approaches that solely rely on sparse feature matches, DeViLoc efficiently predicts dense 2D-3D correspondences by directly converting semi-dense 2D-2D matches obtained via detector-free image matching.
  • Point Inference Network (PIN): The framework employs a specialized network designed to transform detected 2D keypoints into corresponding 3D scene coordinates by leveraging both geometric and visual cues. This approach optimizes the use of available 3D information, enhancing the accuracy and reliability of inferred 3D points.
  • Confidence-based Point Aggregation (CPA): To further refine the generated matches, DeViLoc incorporates a CPA module, which aggregates 2D-3D matches from multiple views based on confidence levels. This process effectively reduces outliers, leading to more precise camera pose estimations.

Performance and Evaluation

Thorough evaluations reveal that DeViLoc surpasses competing methods in various benchmarks, particularly in scenes fraught with challenging conditions such as night-time scenarios, adverse weather, and drastic seasonal changes. Its robust performance is attributed to its novel approach in handling noisy or sparse 3D inputs, a common limitation among current localization techniques.

Key findings from the conducted experiments include:

  • Stable Performance Across Scenes: DeViLoc consistently achieves competitive results across a range of indoor and outdoor scenes, demonstrating its versatility and reliability.
  • Superiority in Handling Noisy Inputs: The framework's resilience against noisy and sparse 3D models is a significant advancement, making it applicable in a broader spectrum of real-world scenarios.

Future Perspective and Limitations

While DeViLoc represents a significant step forward, the potential for enhancements remains. Currently, the runtime performance, particularly when handling a large number of reference images, poses a constraint on its scalability. Future work could explore optimizing the computational efficiency to expand its applicability further.

Moreover, adapting DeViLoc to more extensive datasets and varying conditions could provide deeper insights into its robustness and versatility. Integrating adaptive mechanisms to adjust confidence thresholds dynamically based on scene characteristics may also yield improvements in match filtering and pose estimation accuracy.

Conclusion

DeViLoc's introduction constitutes a substantial advancement in the field of visual localization, offering a compelling solution to some of the most pressing challenges faced by contemporary methods. Its ability to generate reliable semi-dense 2D-3D correspondences places it at the forefront of efforts to enhance localization accuracy in complex environments. As research progresses, DeViLoc's innovative approach could pave the way toward more resilient and versatile localization systems, unlocking new possibilities in robotics, augmented reality, and beyond.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com