DualGeo: A Dual-View Framework for Worldwide Image Geo-localization

Published 28 Apr 2026 in cs.CV | (2604.25533v1)

Abstract: Worldwide image geo-localization aims to infer the geographic location of an image captured anywhere on Earth, spanning street, city, regional, national, and continental scales. Existing methods rely on visual features that are sensitive to environmental variations (e.g., lighting, season, and weather) and lack effective post-processing to filter outlier candidates, limiting localization accuracy. To address these limitations, we propose DualGeo, a two-stage framework for worldwide image geo-localization. First, it establishes a geo-representational foundation by fusing image and semantic segmentation features via bidirectional cross-attention. The fused features are then aligned with GPS coordinates through dual-view contrastive learning to build a global retrieval database. Second, it performs geo-cognitive refinement by re-ranking retrieved candidates using geographic clustering. It then feeds them into large multimodal models (LMMs) for final coordinate prediction. Experiments on IM2GPS, IM2GPS3k, and YFCC4k show that DualGeo outperforms state-of-the-art methods, improving street-level (<1 km) and city-level (<25 km) localization accuracy by 3.6%-16.58% and 1.29%-8.77%, respectively. Our code and datasets are available : https://github.com/CJ310177/DualGeo.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a dual-view contrastive learning framework that fuses RGB cues with semantic segmentation for robust image geo-localization.
It leverages geo-cognitive clustering and LMM-based reasoning to refine coordinate predictions and improve fine-grained localization.
Experimental results show street-level and city-level improvements while highlighting the benefit of integrating invariant structural priors.

DualGeo: A Dual-View Framework for Worldwide Image Geo-localization

Problem Statement and Motivation

The task of worldwide image geo-localization is to estimate the geographic origin of an image captured at any global location, providing coordinate-level resolution even under diverse and challenging environmental conditions. Classical approaches predominantly rely on RGB visual cues, which are highly sensitive to variations such as illumination, weather, and seasonality—factors that impede robust deployment. Existing retrieval-based and generation-augmented localization systems further suffer from a lack of geo-cognitive post-processing, often neglecting geospatial distribution priors present in candidate coordinate sets.

The DualGeo framework directly addresses these limitations by:

Incorporating semantic segmentation as an invariant structural modality to robustly represent geographic context, and
Introducing spatially-aware post-retrieval candidate refinement, leveraging both clustering and reasoning via large multimodal models (LMMs).

Technical Approach

Dual-View Geo-Representational Foundation

Central to the proposed approach is the joint encoding and contrastive alignment of RGB visual features and semantic segmentation maps against GPS coordinates. The fusion employs a dual-view contrastive learning paradigm, with symmetric loss optimization between both RGB–GPS and SEG–GPS modality pairs, using a multi-task contrastive loss. Semantic segmentation maps, generated via high-capacity models (SegFormer-B5), deliver invariant and context-rich cues, augmenting co-located RGB images.

A dedicated segmentation encoder (ResNet-18 based, input-channel-adapted) and the RGB encoder from prior work [16] are coupled through a bidirectional cross-attention mechanism. This enables detail-preserving, complementary fusion: the RGB branch capitalizes on spatial structure to offset visual ambiguity in adverse conditions, while the SEG branch harnesses texture cues for improved fine-grained alignment.

A global retrieval database (“World Index”) is constructed from the fused features for efficient query-by-image candidate retrieval.

The second stage focuses on intelligent post-processing. First, the set of k-nearest retrieved database entries is subjected to DBSCAN-based spatial clustering using the Haversine metric. True positives are hypothesized to form denser clusters, with the geometric center of the top cluster selected as the reference position for outlier suppression and candidate re-ranking. The radius parameter for clustering ( $\varepsilon$ ) is systematically analyzed for its role in granularity control.

(Figure 1)

Figure 1: Performance of different n values on the LMM for the IM2GPS3k dataset.

The refined candidate set is then submitted to an LMM-based module (“Geo-Thinker”). Here, a constructed visual prompt—comprising the query image, top- $n$ cluster-aware candidates, and $n$ hard negatives—enables the LMM to carry out context-enriched reasoning. Importantly, the LMM acts as an arbiter for global semantic filtering rather than as a regressor, producing the final coordinate estimate with enhanced interpretability and robustness.

Experimental Results

The DualGeo model is validated on IM2GPS, IM2GPS3k, and YFCC4k, with a unified training backbone based on large-scale, globally distributed, and segmentation-augmented datasets. Multiple evaluation thresholds are considered, consistent with prior art, ranging from street to continental scales.

Results indicate that DualGeo achieves superior performance, with street-level (<1 km) and city-level (<25 km) localization improvements over baselines in the range of 3.6%–16.58% and 1.29%–8.77%, respectively. Notably, the method exhibits a robust gain in fine-grained inference, outperforming advanced models such as GeoCLIP, PEGION, and G3. There is some trade-off with respect to very coarse-grained localization, reflecting the optimization emphasis on local structural consistency.

Component Analysis and Ablation

An extensive ablation study demonstrates the additive impact of each architectural and algorithmic enhancement:

Sole RGB-based learning offers the lowest accuracy.
Transitioning to dual-view contrastive learning, adding bidirectional cross-attention, then activating geo-clustered re-ranking—as well as Geo-Thinker—yields monotonic increases across all accuracy thresholds.
The LMM reasoning module demonstrates particular efficacy at meso- to macro-scale levels (country and continent).
Varying the number of top candidates $n$ sent to the LMM (Figure 1) and adjusting the re-ranking clustering radius $\varepsilon$ (Figure 2) offer insight into the mechanism-efficiency tradeoff, with smaller $\varepsilon$ favoring fine-scale accuracy and larger values aiding macro-scale reasoning.

(Figure 2)

Figure 2: Performance of different clustering radii ( $\varepsilon$ ) on the IM2GPS3k dataset.

Implications, Limitations, and Future Directions

From a practical standpoint, this dual-modality, two-stage framework establishes new standards for location-awareness in adverse or ambiguous imaging conditions, with broad applicability in GIS, digital forensics, environmental monitoring, and AR navigation. The demonstrated gains in candidate filtering and coordinate resolution exemplify the benefit of integrating structural semantic priors and spatial candidate reasoning.

Theoretically, the method illustrates how cross-attention between disparate but correlated modalities can regularize and enrich the embedding space for retrieval tasks. The use of LMMs for post-hoc reasoning also suggests new directions for integrating external knowledge and inference engines within vision pipelines.

Several future research directions emerge:

Investigation of more efficient or adaptive clustering mechanisms for geo-candidate refinement, especially under extreme database sparsity.
Exploration of additional invariant modalities (e.g., depth, object attribute layouts) in the representation learning phase.
Extension to real-time and privacy-sensitive deployments, including more lightweight or streaming-compatible architectures.

Conclusion

DualGeo redefines the paradigm for image-based global geo-localization by fusing semantic segmentation with RGB cues and integrating geo-cognitive clustering with multimodal reasoning. The framework’s empirically validated improvements—and interpretability—illustrate the effectiveness of context-aware, structure-invariant, and knowledge-augmented localization strategies. These advances have the potential to inaugurate a more robust class of location inference models capable of withstanding the challenges of real-world geographic diversity.