From Coarse to Fine: Robust Hierarchical Localization at Large Scale (1812.03506v2)

Published 9 Dec 2018 in cs.CV

Abstract: Robust and accurate visual localization is a fundamental capability for numerous applications, such as autonomous driving, mobile robotics, or augmented reality. It remains, however, a challenging task, particularly for large-scale environments and in presence of significant appearance changes. State-of-the-art methods not only struggle with such scenarios, but are often too resource intensive for certain real-time applications. In this paper we propose HF-Net, a hierarchical localization approach based on a monolithic CNN that simultaneously predicts local features and global descriptors for accurate 6-DoF localization. We exploit the coarse-to-fine localization paradigm: we first perform a global retrieval to obtain location hypotheses and only later match local features within those candidate places. This hierarchical approach incurs significant runtime savings and makes our system suitable for real-time operation. By leveraging learned descriptors, our method achieves remarkable localization robustness across large variations of appearance and sets a new state-of-the-art on two challenging benchmarks for large-scale localization.

Citations (775)

View on Semantic Scholar

Summary

The paper introduces HF-Net, a CNN that jointly predicts global descriptors and local features for hierarchical localization.
It employs a coarse-to-fine strategy with global retrieval followed by local matching to accurately determine the 6-DoF camera pose.
Experimental results on datasets like Aachen Day-Night and RobotCar prove its robustness and efficiency in challenging visual conditions.

From Coarse to Fine: Robust Hierarchical Localization at Large Scale

This paper, authored by Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk, focuses on solving the problem of robust and accurate visual localization across large-scale environments, particularly in conditions with significant appearance changes. This is essential for applications such as autonomous driving, mobile robotics, and augmented reality.

Overview

The paper introduces HF-Net, a monolithic Convolutional Neural Network (CNN) designed to predict both global descriptors and local features simultaneously. The core algorithm follows a hierarchical localization paradigm, beginning with a coarse global image retrieval to identify candidate locations, followed by a fine-grained matching of local features within these candidates to determine the 6-DoF pose of a camera. This hierarchical approach offers significant runtime savings and makes the system suitable for real-time applications.

Methodology

Global Retrieval: The first stage involves global image retrieval using a learned descriptor to generate hypotheses of possible camera locations.
Covisibility Clustering: The subsequent stage clusters retrieved images based on the 3D structure they co-observe, aiding in reducing the search space for local feature matching.
Local Feature Matching: At this stage, local feature descriptors are matched with the 3D points in the clustered images, followed by a PnP geometric consistency check within a RANSAC scheme to estimate the 6-DoF pose.

The hierarchical localization aims to balance robustness and computational efficiency. The contributions of the paper include setting a new state-of-the-art on public benchmarks for large-scale localization, presenting the HF-Net for efficient feature predictions, and demonstrating the effectiveness of multitask distillation to achieve runtime goals.

Experimental Evaluation

The paper provides a thorough evaluation:

Local Features: SuperPoint and learned descriptors such as DOAP and LF-Net are compared against traditional methods like SIFT. Learned features generally outperform classical ones in terms of accuracy and robustness under significant appearance changes.
Large-scale Localization: The proposed method is tested on several challenging datasets like Aachen Day-Night, RobotCar, and CMU Seasons. HF-Net achieves superior localization performance, particularly in challenging conditions such as night-time queries.

Implications

Practically, this research has significant applications:

Real-time Localization: The hierarchical framework enables real-time performance crucial for applications in autonomous driving and augmented reality.
Adaptability and Robustness: By leveraging learned features, the proposed method accommodates broad variations in appearance due to changing environmental conditions.

Theoretically, the work opens new avenues for integrating different deep learning-based feature extraction techniques into a cohesive and efficient localization pipeline. The introduction of multitask distillation exemplifies how the extraction of diverse features can be harmonized in a single model architecture to achieve optimal performance.

Future Directions

Future developments might involve extending the hierarchical paradigm to more densely populated environments, improving the model’s adaptability to unseen conditions through further advancements in multitask learning. Another exciting direction is optimizing the efficiency of learned descriptors to maintain performance without excessively increasing computational complexity.

In summary, "From Coarse to Fine: Robust Hierarchical Localization at Large Scale" effectively addresses the complex problem of visual localization across varying conditions, presenting a robust, efficient system poised for real-world applications. The method's ability to leverage deep learning improvements in feature extraction without compromising efficiency sets a new benchmark in the field of visual localization.

The future of AI-based localization looks promising with hierarchical methods like HF-Net, potentially pushing boundaries in real-time, robust autonomous systems and immersive augmented reality experiences.

PDF Markdown

Related Papers

YouTube

Show All Videos