Automatic Discovery and Geotagging of Objects from Street View Imagery (1708.08417v2)

Published 28 Aug 2017 in cs.CV

Abstract: Many applications such as autonomous navigation, urban planning and asset monitoring, rely on the availability of accurate information about objects and their geolocations. In this paper we propose to automatically detect and compute the GPS coordinates of recurring stationary objects of interest using street view imagery. Our processing pipeline relies on two fully convolutional neural networks: the first segments objects in the images while the second estimates their distance from the camera. To geolocate all the detected objects coherently we propose a novel custom Markov Random Field model to perform objects triangulation. The novelty of the resulting pipeline is the combined use of monocular depth estimation and triangulation to enable automatic mapping of complex scenes with multiple visually similar objects of interest. We validate experimentally the effectiveness of our approach on two object classes: traffic lights and telegraph poles. The experiments report high object recall rates and GPS accuracy within 2 meters, which is comparable with the precision of single-frequency GPS receivers.

Citations (78)

View on Semantic Scholar

Summary

The paper proposes a novel pipeline for automatically discovering and geotagging stationary objects in street view imagery using two convolutional neural networks for segmentation and depth estimation, combined with an MRF-based triangulation.
The method achieves high accuracy, demonstrating object recall rates over 93% and spatial positioning within 2 meters, comparable to single-frequency GPS receivers.
This technology has significant implications for urban planning, infrastructure monitoring, and autonomous systems by enabling automatic mapping and asset tracking from existing image databases.

Automatic Discovery and Geotagging of Objects from Street View Imagery

This paper presents a novel methodology for the automatic detection and geotagging of stationary objects using street view imagery. The proposed solution employs two state-of-the-art fully convolutional neural networks (FCNNs) within a modular pipeline: one for the segmentation of objects and another for monocular depth estimation. The geotagging is accomplished through a custom Markov Random Field (MRF) model that utilizes triangulation to determine object locations. This technique is particularly adept at handling complex scenes with multiple visually similar objects, such as traffic lights and telegraph poles.

Technical Highlights

The authors introduce a complete image processing pipeline aimed at geopositioning of recurrent stationary objects identified from street view images. The core elements of this pipeline are:

Semantic Segmentation: An FCNN tailored for segmenting the objects of interest, providing pixel-level labels which are then used to refine the estimation of object positions.
Monocular Depth Estimation: Utilizing another FCNN that reports depth information for every detected object, allowing for the approximation of object-to-camera distances from a single view.
MRF-based Triangulation: A geometrically inspired MRF framework that combines multiple viewpoints to enhance the geolocation accuracy. This method innovatively synergizes distance information from the monocular depth estimation with geometric triangulation, essential for scenes injected with multiple similar items.

The paper demonstrates high levels of precision in object recall and GPS accuracy, reportedly achieving positioning within 2 meters of the actual location. This metric aligns closely with the accuracy typically realized by single-frequency GPS receivers.

Experimental Evaluation

The methodology was tested on two classes of objects: traffic lights and telegraph poles. For each, the segmentation networks were effectively trained and validated. The experiments carried out in urban environments for traffic lights and rural settings for telegraph poles independently verified the robustness of the approach, confirming its reliability in detecting objects solely via street view inputs without auxiliary data sources.

Quantitatively, the traffic light detection achieved a recall rate of 94% and precision of 92.2%, whereas the discovery of telegraph poles was marked at a recall of 93.5% and a precision of 96%. The spatial accuracy, gauged against ground truth data, reinforced the geotagging precision claimed by the authors.

Implications and Future Directions

The approach highlights the burgeoning capacity of computer vision techniques and neural networks to autonomously utilize massive public image databases for practical geospatial tasks. From an application perspective, the advantages are clear in fields like urban planning, asset monitoring, and autonomous driving, where the need for precise location data is paramount.

Theoretically, the contribution of a sophisticated triangulation algorithm leveraging monocular depth estimates sets a foundation for further investigation into joint models combining depth generation with object detection in a unified neural framework. Future research could explore refining depth estimation and segmentation synergy within a single neural network, potentially enhancing the accuracy and resolving any ambiguity in densely packed object settings.

By integrating robust triangulation strategies with intense depth analysis networks, this methodology opens up new avenues for heightened precision in mapping and spatial awareness, paving the way for advancements in geographic information systems and autonomy in urban and rural infrastructure monitoring systems.

Related Papers

YouTube

Show All Videos