VS-Net: Voting with Segmentation for Visual Localization (2105.10886v1)

Published 23 May 2021 in cs.CV

Abstract: Visual localization is of great importance in robotics and computer vision. Recently, scene coordinate regression based methods have shown good performance in visual localization in small static scenes. However, it still estimates camera poses from many inferior scene coordinates. To address this problem, we propose a novel visual localization framework that establishes 2D-to-3D correspondences between the query image and the 3D map with a series of learnable scene-specific landmarks. In the landmark generation stage, the 3D surfaces of the target scene are over-segmented into mosaic patches whose centers are regarded as the scene-specific landmarks. To robustly and accurately recover the scene-specific landmarks, we propose the Voting with Segmentation Network (VS-Net) to segment the pixels into different landmark patches with a segmentation branch and estimate the landmark locations within each patch with a landmark location voting branch. Since the number of landmarks in a scene may reach up to 5000, training a segmentation network with such a large number of classes is both computation and memory costly for the commonly used cross-entropy loss. We propose a novel prototype-based triplet loss with hard negative mining, which is able to train semantic segmentation networks with a large number of labels efficiently. Our proposed VS-Net is extensively tested on multiple public benchmarks and can outperform state-of-the-art visual localization methods. Code and models are available at \href{https://github.com/zju3dv/VS-Net}{https://github.com/zju3dv/VS-Net}.

Authors (9)

Zhaoyang Huang (27 papers)
Han Zhou (72 papers)
Yijin Li (20 papers)
Bangbang Yang (18 papers)
Yan Xu (259 papers)
Xiaowei Zhou (122 papers)
Hujun Bao (134 papers)
Guofeng Zhang (173 papers)
Hongsheng Li (340 papers)

Citations (42)

View on Semantic Scholar

Summary

The paper introduces VS-Net, a dual-branch framework that uses scene-specific landmarks to enhance 2D-to-3D correspondences for robust visual localization.
It employs a prototype-based triplet loss with hard negative mining to efficiently train segmentation networks even with up to 5000 classes.
Extensive testing on 7Scenes and Cambridge Landmarks demonstrates lower median camera pose errors compared to state-of-the-art methods.

VS-Net: Voting with Segmentation for Visual Localization

The paper "VS-Net: Voting with Segmentation for Visual Localization" presents an innovative framework for visual localization that addresses the limitations of existing scene coordinate regression methods used extensively in small static environments. The proposed approach introduces a novel method for establishing 2D-to-3D correspondences by generating scene-specific landmarks, enhancing both robustness and accuracy in camera pose estimation.

The framework is built around the Voting with Segmentation Network (VS-Net), which diverges from traditional dense scene coordinate regression by focusing on a smaller set of high-precision scene-specific landmarks. In the landmark generation stage, the target scene's 3D surfaces are systematically segmented into mosaic patches, with patch centers defining the scene-specific landmarks. VS-Net comprises two branches: a segmentation branch assigns each pixel to a landmark patch, and a landmark location voting branch refines the landmark's position within each patch. This dual-branch setup ensures that only reliable correspondences are used, mitigating the impact of inferior scene coordinates that can corrupt camera pose estimation with RANSAC algorithms in dynamic settings with moving objects.

One of the key contributions of the paper is the development of the prototype-based triplet loss with hard negative mining, enabling efficient training for semantic segmentation networks with a high number of classes—a necessity given that some scenes can have up to 5000 landmarks. This approach substantially reduces computation and memory demands, allowing the proposed system to efficiently handle the large class labels typical in visual localization scenarios.

The performance of VS-Net is compelling, with extensive testing on several public benchmarks showcasing its superiority over state-of-the-art methods. For instance, on both the 7Scenes and Cambridge Landmarks datasets, VS-Net consistently achieves lower median camera pose errors compared to several prominent approaches, including those employing traditional feature-based frameworks or hierarchical scene coordinate methods such as HSC-Net and DSAC++.

These results underscore the practical importance of refining the correspondence estimation process in visual localization systems. The introduction of scene-specific landmarks as a sparse, yet highly precise representation of the scene can significantly enhance the robustness of visual placement even in challenging environments characterized by rapid changes or occlusions.

The theoretical implications of this research suggest a potential paradigm shift towards the use of adaptive, scene-specific features in visual localization tasks. Given the results, future developments could explore integrating these landmarks with temporal information or hierarchical spatial structures to further enhance accuracy and robustness in real-world robotic and computer vision applications.

In conclusion, the research contributes a robust framework that not only enhances visual localization in small static scenes but also establishes a foundation for future improvements in adaptable feature-based localization methods. The insights drawn here could be particularly beneficial as the field of AI continues to grapple with dynamic and unstructured real-world environments, driving further innovations in localization techniques.

PDF Markdown

Related Papers

GitHub

GitHub - zju3dv/VS-Net: VS-Net: Voting with Segmentation for Visual Localization (86 stars)