Street-Level Geolocalization Using Multimodal Large Language Models and Retrieval-Augmented Generation

Published 1 Sep 2025 in cs.CV and cs.AI | (2509.01341v1)

Abstract: Street-level geolocalization from images is crucial for a wide range of essential applications and services, such as navigation, location-based recommendations, and urban planning. With the growing popularity of social media data and cameras embedded in smartphones, applying traditional computer vision techniques to localize images has become increasingly challenging, yet highly valuable. This paper introduces a novel approach that integrates open-weight and publicly accessible multimodal LLMs with retrieval-augmented generation. The method constructs a vector database using the SigLIP encoder on two large-scale datasets (EMP-16 and OSV-5M). Query images are augmented with prompts containing both similar and dissimilar geolocation information retrieved from this database before being processed by the multimodal LLMs. Our approach has demonstrated state-of-the-art performance, achieving higher accuracy compared against three widely used benchmark datasets (IM2GPS, IM2GPS3k, and YFCC4k). Importantly, our solution eliminates the need for expensive fine-tuning or retraining and scales seamlessly to incorporate new data sources. The effectiveness of retrieval-augmented generation-based multimodal LLMs in geolocation estimation demonstrated by this paper suggests an alternative path to the traditional methods which rely on the training models from scratch, opening new possibilities for more accessible and scalable solutions in GeoAI.