MegaLoc: One Retrieval to Place Them All
(2502.17237v3)
Published 24 Feb 2025 in cs.CV
Abstract: Retrieving images from the same location as a given query is an important component of multiple computer vision tasks, like Visual Place Recognition, Landmark Retrieval, Visual Localization, 3D reconstruction, and SLAM. However, existing solutions are built to specifically work for one of these tasks, and are known to fail when the requirements slightly change or when they meet out-of-distribution data. In this paper we combine a variety of existing methods, training techniques, and datasets to train a retrieval model, called MegaLoc, that is performant on multiple tasks. We find that MegaLoc (1) achieves state of the art on a large number of Visual Place Recognition datasets, (2) impressive results on common Landmark Retrieval datasets, and (3) sets a new state of the art for Visual Localization on the LaMAR datasets, where we only changed the retrieval method to the existing localization pipeline. The code for MegaLoc is available at https://github.com/gmberton/MegaLoc
The paper "MegaLoc: One Retrieval to Place Them All" (Berton et al., 24 Feb 2025) addresses the challenge that image retrieval methods used in computer vision tasks like Visual Place Recognition (VPR), Landmark Retrieval (LR), and Visual Localization (VL) are typically task-specific and struggle with diverse data distributions or different definitions of what constitutes the "same place". VPR usually considers images within 25 meters, LR focuses on the same landmark regardless of distance, and VL requires very close poses for 3D reconstruction. Existing pipelines often rely on older retrieval methods. The authors propose to overcome this fragmentation by training a single, general-purpose image retrieval model called MegaLoc, designed to perform well across these diverse tasks and domains.
The core idea behind MegaLoc is not architectural novelty, but rather the strategic fusion of data from multiple datasets and the application of established training techniques. The authors combine five diverse datasets: GSV-Cities [2022.gsvcities], Mapillary Street-Level Sequences (MSLS) [2020.msls], MegaScenes [2024.Megascenes], ScanNet [2017.scannet], and San Francisco eXtra Large (SF-XL) [2022.cosPlace]. These datasets offer a variety of outdoor and indoor scenes, different camera perspectives (street-level, hand-held, car-mounted), and challenging conditions like night-time, occlusions, and long-term changes.
During training, the model processes six sub-batches in each iteration, one from each dataset (with two sub-batches from SF-XL covering different perspectives). To handle the varied formats and requirements of these datasets, specific sampling techniques are employed:
SF-XL: Utilizes the EigenPlaces [2023.EigenPlaces] sampling method to ensure diverse viewpoints within a place while avoiding visual overlap between different places.
GSV-Cities: Uses direct sampling as the dataset is already structured into non-overlapping classes [2022.gsvcities].
MSLS: Employs the CliqueMining [2024.cliqueM] technique to specifically mine hard negatives, finding visually similar places that are geographically distinct.
MegaScenes: Samples images from 3D reconstructions ensuring that images within a sampled set have significant visual overlap (defined as sharing at least 1% of 3D points).
ScanNet: Selects image quadruplets within a scene that have visual overlap (pose difference < 10 meters and < 30 degrees) while ensuring no visual overlap between different quadruplets.
The model is trained using a multi-similarity loss [2019.multi_similarity_loss] calculated independently for each of the six sub-batches, and the total loss is the sum of these individual losses. The architecture consists of a DINO-v2-base backbone [2023.dinov2] followed by a SALAD aggregation layer 2024.SALAD, a linear projection to 8448 dimensions, and L2 normalization. Images are resized to 224x224 for training and 322x322 for inference. RandAugment [2020.RandAugment] is used for data augmentation, and AdamW [2018.AdamW] is the optimizer. Training runs for 40,000 iterations. A notable implementation detail is the memory-efficient GPU training achieved by calling backward() after computing the loss for each sub-batch individually, which frees the computational graph and significantly reduces VRAM requirements (from ~300GB to ~60GB) compared to building a single large graph.
The paper presents experimental results demonstrating MegaLoc's performance across VPR, Visual Localization, and Landmark Retrieval tasks:
Visual Place Recognition: Evaluated on a wide range of VPR datasets (Baidu [2017.Baidu_dataset], Eynsham [2022.benchmark, 2009.eynsham], MSLS val [2020.msls], Pitts250k [2018.netvlad, 2013.cvpr_pitts], Pitts30k [2018.netvlad, 2013.cvpr_pitts], SF-XL v1/v2/night/occlusion [2022.cosPlace, 2023.local_features_benchmark], Tokyo 24/7 [2018.tokyo247]). MegaLoc achieves state-of-the-art or highly competitive Recall@1 and Recall@10 results across the board, particularly excelling on the indoor-only Baidu dataset where it significantly outperforms other models like SALAD [2024.SALAD] and CliqueMining [2024.cliqueM].
Visual Localization: Tested on the LaMAR benchmark datasets [2022.lamar] by replacing the default retrieval method in the benchmark's pipeline. MegaLoc demonstrates impressive performance across different locations (CAB, HGE, LIN) and query types (Phone, HoloLens), achieving competitive or better results at the strict pose accuracy thresholds (e.g., (1°, 0.1m) and (5°, 1.0m)) compared to other methods. This highlights its practical applicability in 3D vision pipelines like Hierarchical Localization [2019.hloc].
Landmark Retrieval: Evaluated on the Revisited Oxford5k and Paris6k datasets [2018.roxford_rparis]. MegaLoc shows a substantial performance improvement over previous VPR-focused models, which were optimized for closer retrievals (within 25m). This demonstrates MegaLoc's ability to handle the larger spatial distances characteristic of landmark retrieval tasks.
The authors analyze failure cases, categorizing them into inherently difficult scenarios, cases potentially solvable with post-processing (like re-ranking with local features), issues stemming from incorrect GPS labels in datasets, and instances where correct predictions fall just outside the strict 25m VPR threshold but are useful in real-world applications. They also note that poor database coverage in datasets like MSLS (e.g., images only facing one direction on a street) can hinder performance, though this is less of an issue in well-covered real-world scenarios.
In conclusion, the paper argues that while image retrieval for localization is nearing a point of maturity on specific datasets, MegaLoc bridges a gap by providing a single model capable of performing well across the diverse requirements of VPR, VL, and LR. This is achieved by training on a broad collection of data using effective sampling and training strategies. However, the paper also identifies limitations: MegaLoc might be suboptimal for datasets dominated by purely forward-facing views (like MSLS), challenging natural environments (where AnyLoc [2023.AnyLoc] might be preferred), and resource-constrained embedded systems due to its large model size (228M parameters) compared to more lightweight options (e.g., ResNet-18 based CosPlace [2022.cosPlace] with 11M parameters). The code for MegaLoc is publicly available, facilitating its adoption in various applications.