Visual Camera Re-Localization from RGB and RGB-D Images Using DSAC (2002.12324v4)

Published 27 Feb 2020 in cs.CV and cs.LG

Abstract: We describe a learning-based system that estimates the camera position and orientation from a single input image relative to a known environment. The system is flexible w.r.t. the amount of information available at test and at training time, catering to different applications. Input images can be RGB-D or RGB, and a 3D model of the environment can be utilized for training but is not necessary. In the minimal case, our system requires only RGB images and ground truth poses at training time, and it requires only a single RGB image at test time. The framework consists of a deep neural network and fully differentiable pose optimization. The neural network predicts so called scene coordinates, i.e. dense correspondences between the input image and 3D scene space of the environment. The pose optimization implements robust fitting of pose parameters using differentiable RANSAC (DSAC) to facilitate end-to-end training. The system, an extension of DSAC++ and referred to as DSAC*, achieves state-of-the-art accuracy an various public datasets for RGB-based re-localization, and competitive accuracy for RGB-D-based re-localization.

Authors (2)

Eric Brachmann (27 papers)
Carsten Rother (74 papers)

Citations (191)

View on Semantic Scholar

Summary

The paper introduces a DSAC-based framework that leverages dense scene coordinate regression to accurately estimate camera poses from single RGB or RGB-D images.
It employs differentiable RANSAC and PnP solvers to enable end-to-end optimization of the scene coordinate predictions against pose estimation errors.
Experimental results on 7Scenes and Cambridge Landmarks demonstrate state-of-the-art performance, underscoring its potential in augmented reality and autonomous navigation.

Visual Camera Re-Localization from RGB and RGB-D Images Using DSAC

The paper presents a robust and flexible framework for visual camera re-localization that leverages a learning-based system to determine camera position and orientation from single RGB or RGB-D input images. This research is aimed at applications requiring precision in localization, such as augmented reality or autonomous navigation, where satellite-based systems like GPS fall short under obstructions like indoor environments or urban canyons.

Overview of the Approach

The core methodology hinges on the use of scene coordinate regression, which involves training a neural network to predict dense 3D correspondences, termed scene coordinates, that link image pixels to points in a 3D model of the environment. This prediction facilitates camera pose estimation using a fully differentiable robust fitting approach based on RANSAC, known as Differentiable Sample Consensus (DSAC).

Key features of the system include:

Flexible input requirements: At minimum, the system only requires RGB images and corresponding ground truth poses for training.
End-to-end differential training: The framework incorporates differentiable RANSAC and PnP solvers, enabling the optimization of scene coordinate predictions relative to the pose estimation error.
Applicability across modalities: The system can make use of depth information when available, which enhances the accuracy of RGB-D-based re-localization.

Technical Methodology

The methodology involves several steps:

Scene Coordinate Regression: A convolutional neural network is trained to output dense scene coordinates by minimizing errors against known scene geometries. When unavailable, these geometries are hallucinated using known camera poses.
Pose Estimation: Using RANSAC, multiple hypotheses of the camera pose are generated and evaluated based on their consensus with predicted scene coordinates. The hypothesis with the highest soft inlier count is selected and refined.
Differential Optimization: The differentiability of each step allows backpropagation of the pose loss through the entire system, refining the network's prediction capabilities.

Numerical Results and Comparative Analysis

The DSAC* system demonstrates high accuracy across various public benchmarks. In the indoor 7Scenes and 12Scenes datasets, the system achieves state-of-the-art results, particularly excelling when leveraging depth information. For the Cambridge Landmarks outdoor dataset, the system maintains competitive performance, even when trained without explicit 3D models.

On 7Scenes, the DSAC* model reaches 95.6% accuracy for RGB-D inputs and 85.2% for RGB with a 3D model, outperforming previous versions like DSAC++.
On Cambridge Landmarks, the system shows enhanced precision in median position errors, illustrating improvements in robustness and reliability over competitive approaches like ActiveSearch and PoseNet.

Implications and Future Developments in AI

The advancement introduced by DSAC* notably broadens the spectrum of feasible applications in both indoor and outdoor environments due to its generalized and flexible setup requirements. The integration of differentiable components across the prediction pipeline serves as a testament to the growing trend toward end-to-end optimizable systems in AI. Future research may explore optimizing the system further for efficiency, broader environment adaptability, and integration with dynamic object tracking. Additionally, leveraging larger-scale learning paradigms may permit the deployment of such technology in city-wide or more complex environments, applicable in real-time operational scenarios.

Closing Remarks

This research significantly contributes to the domain of visual localization by enhancing the adaptability, preciseness, and deployment readiness of camera re-localization systems. With an increase in automation and augmented reality use-cases, solutions such as DSAC* provide both foundational insights and practical advancements necessary for continued progress in autonomous systems and intelligent environments.

PDF Markdown