Viewpoint Invariant Dense Matching for Visual Geolocalization (2109.09827v1)

Published 20 Sep 2021 in cs.CV

Abstract: In this paper we propose a novel method for image matching based on dense local features and tailored for visual geolocalization. Dense local features matching is robust against changes in illumination and occlusions, but not against viewpoint shifts which are a fundamental aspect of geolocalization. Our method, called GeoWarp, directly embeds invariance to viewpoint shifts in the process of extracting dense features. This is achieved via a trainable module which learns from the data an invariance that is meaningful for the task of recognizing places. We also devise a new self-supervised loss and two new weakly supervised losses to train this module using only unlabeled data and weak labels. GeoWarp is implemented efficiently as a re-ranking method that can be easily embedded into pre-existing visual geolocalization pipelines. Experimental validation on standard geolocalization benchmarks demonstrates that GeoWarp boosts the accuracy of state-of-the-art retrieval architectures. The code and trained models are available at https://github.com/gmberton/geo_warp

Authors (4)

Gabriele Berton (18 papers)
Carlo Masone (32 papers)
Valerio Paolicelli (3 papers)
Barbara Caputo (105 papers)

Citations (31)

View on Semantic Scholar

Summary

Viewpoint Invariant Dense Matching for Visual Geolocalization

The paper introduces "GeoWarp," a novel methodology targeting the improvement of visual geolocalization (VG). Visual geolocalization involves determining the location where an image was captured, a critical task for various applications such as robotics localization in GPS-denied environments and augmented reality. This work primarily addresses the limitations of global image descriptors, particularly under conditions involving significant viewpoint shifts.

Methodology Overview

GeoWarp represents a novel approach that integrates dense local feature matching with a learned invariance to viewpoint shifts. At the core, GeoWarp comprises several key components:

Dense Local Feature Extraction: Unlike traditional VG methods that rely solely on global image descriptors, GeoWarp leverages dense local features which are computed across a grid. This approach ensures robustness against illumination changes and occlusions.
Learnable Viewpoint Invariance: To handle the critical issue of viewpoint shifts, GeoWarp incorporates a trainable module that learns meaningful viewpoint invariant representations for recognizing locations. This module—referred to as the warping regression module—estimates homographic transformations that align image pairs in a viewpoint invariant manner.
Multifaceted Training Losses: Training the warping module utilizes a mix of self-supervised and weakly supervised losses. This training scheme avoids the need for extensive labeled data. The self-supervised loss generates training examples from a single image using random quadrilateral sampling, ensuring diverse perspective variations. Two weakly supervised losses—features-wise loss and consistency loss—further enhance the model's robustness to appearance variations and occlusions.

Implementation and Evaluation

GeoWarp is implemented as a re-ranking module, enhancing existing VG pipelines. After an initial global descriptor-based retrieval, top predictions undergo this re-ranking using dense local feature matching enhanced by the learned transformations.

Extensive experimental validation on benchmark datasets—Pitts30k and R-Tokyo—demonstrates GeoWarp's efficacy. Results indicate significant improvements across various backbone architectures (AlexNet, VGG16, ResNet-50) and aggregation methods (GeM, NetVLAD).

Key Numerical Results:

For example, on the Pitts30k dataset:
- Using AlexNet + GeM, GeoWarp improved recall@1 (10m) from 50.4% to 61.5%.
- Using VGG16 + GeM, recall@1 (50m) increased from 76.3% to 83.1%.

These enhancements underscore the utility of dense local feature matching endowed with viewpoint invariance. Comparisons with state-of-the-art methods like query expansion with DBA, diffusion, DELG, InLoc, and others further validate GeoWarp's superior performance.

Implications and Future Directions

Practical Implications:

GeoWarp's improvements in robustness to viewpoint shifts can aid in more accurate geolocalization, which is critical for autonomous navigation and augmented reality applications. The ability to integrate with existing VG systems enhances its practicality for deployment in real-world scenarios.

Theoretical Implications:

The proposed viewpoint invariant dense matching approach extends the utility of local feature matching by dynamically adapting to viewpoint variations. The novel combination of self-supervised and weakly supervised learning strategies for training vision models opens new avenues for leveraging unlabeled and weakly labeled data.

Future Developments:

Future work could explore:

Extending GeoWarp's methodology to other related tasks, such as object recognition in varying perspectives.
Enhancing the computational efficiency of the warping regression module for real-time applications.
Integrating multi-modal data, such as combining visual and LiDAR information, to further bolster geolocalization accuracy.

In conclusion, the paper presents a compelling development in visual geolocalization, demonstrating significant advancements through the introduction of GeoWarp. The combination of dense local features and viewpoint invariant transformations represents a robust framework for addressing the challenges posed by diverse and dynamic visual environments.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - gmberton/geo_warp: Official repository of ICCV21 paper "Viewpoint Invariant Dense Matching for Visual Geolocalization" (57 stars)

Tweets

https://twitter.com/gabriberton/status/1796531585958985941

YouTube

Show All Videos