Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network (1707.09733v2)

Published 31 Jul 2017 in cs.CV

Abstract: We propose a new deep learning based approach for camera relocalization. Our approach localizes a given query image by using a convolutional neural network (CNN) for first retrieving similar database images and then predicting the relative pose between the query and the database images, whose poses are known. The camera location for the query image is obtained via triangulation from two relative translation estimates using a RANSAC based approach. Each relative pose estimate provides a hypothesis for the camera orientation and they are fused in a second RANSAC scheme. The neural network is trained for relative pose estimation in an end-to-end manner using training image pairs. In contrast to previous work, our approach does not require scene-specific training of the network, which improves scalability, and it can also be applied to scenes which are not available during the training of the network. As another main contribution, we release a challenging indoor localisation dataset covering 5 different scenes registered to a common coordinate frame. We evaluate our approach using both our own dataset and the standard 7 Scenes benchmark. The results show that the proposed approach generalizes well to previously unseen scenes and compares favourably to other recent CNN-based methods.

Citations (184)

View on Semantic Scholar

Summary

The paper introduces a novel CNN-based method using a Siamese network with ResNet34 to estimate pairwise relative camera poses, enhancing scalability and generalization.
It integrates image retrieval with consensus algorithms to derive absolute camera positions without relying on scene-specific training.
Evaluations on the 7 Scenes benchmark and a new University dataset demonstrate competitive accuracy with translation errors below 0.5 m and orientation errors around 10°.

Overview of Camera Relocalization Using CNNs

The paper introduces a novel approach to camera relocalization by leveraging convolutional neural networks (CNN) to compute pairwise relative poses between images. This method addresses the fundamental problem in the robotics and computer vision community of determining camera pose from visual scene data, which is crucial for applications such as autonomous navigation, augmented reality, and simultaneous localization and mapping (SLAM).

Methodology

This research extends previous methodologies by incorporating CNNs to predict relative camera poses instead of absolute poses. The proposed system consists of two key modules: a Siamese CNN architecture for estimating relative poses between pairs of images, and a localization pipeline that integrates these estimates into absolute camera positions.

The Siamese network is structured with two branches that share weights and utilize a ResNet34 architecture. Each branch processes images to generate representations which are used both for image retrieval and relative pose estimation. The system retrieves images from a database that have known poses similar to a query image, using image descriptors generated by the CNN. The relative pose estimation between the query and database images is then computed, and triangulation and consensus algorithms are employed to derive the camera location.

Key Contributions and Results

One of the main contributions is the decoupling of camera pose learning from scene-specific coordinate frames by shifting focus to relative pose estimation. This results in enhanced scalability and generalization capabilities to unseen scenes, as demonstrated by the model’s favorable performance compared to recent CNN-based methods across different datasets. Notably, the proposed approach does not require scene-specific network training, thereby increasing applicability across diverse environments.

Quantitative evaluations performed on both the established 7 Scenes benchmark and the newly introduced University dataset reveal that the approach achieves competitive localization accuracy with median translation errors ranging typically below 0.5 meters and orientation errors around 10 degrees. The University dataset, which consists of multiple indoor scenes registered to a unified coordinate frame, further demonstrates the model's capability to handle complex relocalization scenarios that mimic real-world applications.

Implications and Future Directions

The implications of this work are significant for the development of robust camera relocalization systems in large-scale environments where continuity of coordinate frames across different scene segments can be maintained. The approach also opens possibilities for enhancing computational efficiency in localization processes by reducing dependencies on large memory footprint datasets during inference.

Future research could expand on improving the end-to-end learning of both relative pose estimation and image similarity, potentially optimizing network architectures to further enhance retrieval and localization accuracy. Moreover, exploring architectures that jointly optimize geometric and photometric information may yield even more reliable pose estimations.

In summary, this paper contributes a scalable, generalizable, and robust system for camera relocalization by focusing on relative pose estimation using deep learning architectures, providing a promising foundation for future advancements in AI-driven localization technologies.

PDF Markdown