Learning Camera Localization via Dense Scene Matching (2103.16792v1)

Published 31 Mar 2021 in cs.CV

Abstract: Camera localization aims to estimate 6 DoF camera poses from RGB images. Traditional methods detect and match interest points between a query image and a pre-built 3D model. Recent learning-based approaches encode scene structures into a specific convolutional neural network (CNN) and thus are able to predict dense coordinates from RGB images. However, most of them require re-training or re-adaption for a new scene and have difficulties in handling large-scale scenes due to limited network capacity. We present a new method for scene agnostic camera localization using dense scene matching (DSM), where a cost volume is constructed between a query image and a scene. The cost volume and the corresponding coordinates are processed by a CNN to predict dense coordinates. Camera poses can then be solved by PnP algorithms. In addition, our method can be extended to temporal domain, which leads to extra performance boost during testing time. Our scene-agnostic approach achieves comparable accuracy as the existing scene-specific approaches, such as KFNet, on the 7scenes and Cambridge benchmark. This approach also remarkably outperforms state-of-the-art scene-agnostic dense coordinate regression network SANet. The Code is available at https://github.com/Tangshitao/Dense-Scene-Matching.

Authors (5)

Shitao Tang (15 papers)
Chengzhou Tang (9 papers)
Rui Huang (128 papers)
Siyu Zhu (64 papers)
Ping Tan (101 papers)

Citations (34)

View on Semantic Scholar

Summary

Learning Camera Localization via Dense Scene Matching

This paper addresses the problem of camera localization, which involves estimating the six degrees of freedom (6-DoF) camera pose from RGB images. Traditional techniques typically relied upon detecting and matching interest points between a query image and a pre-constructed 3D scene model, but this approach can be particularly challenging for large-scale scenes. The recent focus has shifted towards learning-based methods where Convolutional Neural Networks (CNNs) are employed to predict dense 3D coordinates from RGB images. However, these approaches often require retraining or adaptation for new scenes.

The authors propose a novel method, termed Dense Scene Matching (DSM), which aims to be scene-agnostic. The key innovation lies in constructing a cost volume between a query image and the scene, which is then processed by a CNN to predict dense coordinates. The camera poses are subsequently solved using Perspective-n-Point (PnP) algorithms. A noteworthy extension of this method is its application to temporal domains, which yields performance improvements in video-based applications.

The performance of the DSM method is evaluated against existing approaches using two benchmark datasets: 7scenes (indoor) and Cambridge (outdoor) datasets. Remarkably, despite being scene-agnostic, the proposed approach achieves accuracy comparable to scene-specific methods like KFNet. DSM also significantly outperforms state-of-the-art scene-agnostic methods such as SANet.

Key Contributions and Results

Scene-Agnostic Localization: DSM leverages dense scene matching, which facilitates camera localization without the need for scene-specific re-training, proving to be effective even in large-scale or novel environments.
Cost Volume Construction: The innovative use of a cost volume to record the correlation between the features of query pixels and 3D scene points is centerpiece to their methodology. This volume is processed to determine dense coordinates, effectively managing the irregularity and variability of real-world scenes.
Temporal Fusion for Video Localization: By incorporating temporal correlations, the DSM method extends to video sequences, enhancing localization accuracy. This is particularly valuable for fields such as augmented reality, robotics, and SLAM, where real-time adjustments are critical.
Performance Evaluation: The DSM method is evaluated rigorously, exhibiting substantial improvements over SANet and achieving comparable results to state-of-the-art scene-specific methods in both indoor and outdoor environments.
Numerical and Architectural Efficiency: Not only does DSM outperform many existing models in terms of accuracy, but it also optimizes processing in terms of time and memory usage, which are critical factors for practical deployment.

Implications and Future Directions

The scene-agnostic nature of DSM poses significant potential for applications requiring frequent adaptation to new environments, such as robotic navigation or AR systems deployed across diverse locations. The robust temporal fusion for video further indicates applications in dynamic settings where continual localization updates are required.

Looking forward, advancements in the integration with global descriptors or improved hybrid architectures that leverage sparse features may further bolster DSM's localization precision and efficiency. Additionally, real-world deployments could benefit from a deeper exploration of how DSM handles extremely textureless or dynamically changing environments, potentially informing the design of more generalized visual perception systems in AI.

PDF Markdown

Related Papers

GitHub

GitHub - Tangshitao/Dense-Scene-Matching: Learning Camera Localization via Dense Scene Matching, CVPR2021 (77 stars)