PlaNet - Photo Geolocation with Convolutional Neural Networks (1602.05314v1)

Published 17 Feb 2016 in cs.CV

Abstract: Is it possible to build a system to determine the location where a photo was taken using just its pixels? In general, the problem seems exceptionally difficult: it is trivial to construct situations where no location can be inferred. Yet images often contain informative cues such as landmarks, weather patterns, vegetation, road markings, and architectural details, which in combination may allow one to determine an approximate location and occasionally an exact location. Websites such as GeoGuessr and View from your Window suggest that humans are relatively good at integrating these cues to geolocate images, especially en-masse. In computer vision, the photo geolocation problem is usually approached using image retrieval methods. In contrast, we pose the problem as one of classification by subdividing the surface of the earth into thousands of multi-scale geographic cells, and train a deep network using millions of geotagged images. While previous approaches only recognize landmarks or perform approximate matching using global image descriptors, our model is able to use and integrate multiple visible cues. We show that the resulting model, called PlaNet, outperforms previous approaches and even attains superhuman levels of accuracy in some cases. Moreover, we extend our model to photo albums by combining it with a long short-term memory (LSTM) architecture. By learning to exploit temporal coherence to geolocate uncertain photos, we demonstrate that this model achieves a 50% performance improvement over the single-image model.

Authors (3)

Tobias Weyand (14 papers)
Ilya Kostrikov (25 papers)
James Philbin (7 papers)

Citations (399)

View on Semantic Scholar

Summary

An Overview of "PlaNet - Photo Geolocation with Convolutional Neural Networks"

Introduction

The paper presents "PlaNet," a convolutional neural network (CNN) model, designed to address the complex task of geolocating images solely based on pixel information. Geolocation, in this context, refers to identifying the location where an image was taken. The proposed method diverges from traditional image retrieval approaches by framing the problem as a classification task, subdividing the earth's surface into numerous geographic cells and training the network on millions of geotagged images. This method facilitates the integration of heterogeneous visual cues such as architectures, landscapes, and weather patterns.

Methodology

The core of PlaNet's methodology lies in its adaptive subdivision of the earth's surface into S2 cells, a technique that enables finer granularity in densely populated areas and broader coverage in sparsely populated regions. This selective granularity ensures a more balanced class distribution during training. A CNN based on the Inception architecture with batch normalization processes the subdivided input data, outputting a probability distribution across geographic cells, allowing the model to express confidence and identify uncertainty in its predictions.

A notable extension of this work involves the incorporation of Long Short-Term Memory (LSTM) architectures to handle sequences of photos, such as albums. The LSTM exploits temporal coherence, thus enhancing the geolocation of images that individually contain limited location cues.

Results and Competency

The results demonstrate PlaNet's effectiveness, outperforming traditional geolocation methods such as Im2GPS by a significant margin, especially noted with an improvement at street-level geolocation. Furthermore, in a competitive experimental setup against human players on the GeoGuessr platform, PlaNet exhibited superior performance, attributable to its exposure to a larger dataset than any human might experience in reality, leading to a broader understanding of diverse geographical cues.

The model's efficacy extends to image retrieval tasks where PlaNet-derived features showed superior performance on benchmark datasets like INRIA Holidays, reinforcing its applicability beyond geolocation to general scene recognition and retrieval tasks.

Implications and Future Directions

PlaNet's approach, leveraging CNNs for geolocation, introduces a scalable solution capable of accurate photo localization across diverse global regions, notably with an ability to transcend human intuition in certain tasks. The utilization of LSTMs in augmenting sequence-based geolocation exemplifies the potential for temporal models in enhancing single-instance predictions through contextual continuity.

Theoretically, this work prompts further reflection on the ability of neural networks to learn complex visual and context-based tasks that integrate semantic scene understanding. Practically, applications in fields such as digital forensics, tourism, and location-based services stand to benefit significantly.

In the future, it would be intriguing to explore PlaNet's adaptation to mobile environments given its compact model size, alongside the potential integration of additional contextual data such as temporal and text metadata to refine its predictions further. The exploration of hybrid models combining image data with ancillary non-visual cues might also yield promising results, enhancing both the precision and reliability of geolocation tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/JfkWhitlam/status/1875638985692999694

https://twitter.com/giffmana/status/1792594014023147584

YouTube

Show All Videos