Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 34 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 21 tok/s Pro

GPT-4o 130 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 460 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Map-Relative Pose Regression for Visual Re-Localization (2404.09884v1)

Published 15 Apr 2024 in cs.CV and cs.LG

Abstract: Pose regression networks predict the camera pose of a query image relative to a known environment. Within this family of methods, absolute pose regression (APR) has recently shown promising accuracy in the range of a few centimeters in position error. APR networks encode the scene geometry implicitly in their weights. To achieve high accuracy, they require vast amounts of training data that, realistically, can only be created using novel view synthesis in a days-long process. This process has to be repeated for each new scene again and again. We present a new approach to pose regression, map-relative pose regression (marepo), that satisfies the data hunger of the pose regression network in a scene-agnostic fashion. We condition the pose regressor on a scene-specific map representation such that its pose predictions are relative to the scene map. This allows us to train the pose regressor across hundreds of scenes to learn the generic relation between a scene-specific map representation and the camera pose. Our map-relative pose regressor can be applied to new map representations immediately or after mere minutes of fine-tuning for the highest accuracy. Our approach outperforms previous pose regression methods by far on two public datasets, indoor and outdoor. Code is available: https://nianticlabs.github.io/marepo

References (59)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a map-relative pose regression that leverages scene-specific maps to improve camera pose estimation accuracy without extensive per-scene training data.
The approach employs dynamic positional encoding with camera intrinsics integrated into a transformer-based regressor to enhance generalization across scenes.
Experimental results on 7-Scenes and Wayspots demonstrate significant accuracy and efficiency improvements over traditional APR and correspondence-based methods.

Map-Relative Pose Regression for Visual Re-Localization

Introduction

The paper "Map-Relative Pose Regression for Visual Re-Localization" introduces a novel approach to camera pose estimation by leveraging map-relative pose regression, which significantly improves over traditional absolute pose regression (APR) methods. This new methodology allows for high accuracy pose predictions by conditioning the regressor on a scene-specific map representation, enabling it to be trained across various scenes and applied to new ones with minimal fine-tuning requirements.

Background and Motivation

Visual relocalization remains a challenge in computer vision, where methods typically fall into correspondence-based or pose regression categories. The former leverages image-to-scene correspondences for accuracy but requires extensive mapping. Pose regression approaches use neural networks for efficiency but traditionally lack the accuracy of correspondence-based methods. Absolute pose regression, in particular, suffers from needing vast training data per scene, making it challenging to scale. The proposed map-relative pose regression approach addresses these limitations by integrating scene-specific maps into the pose regression framework, achieving high accuracy with reduced training times.

Methodology

Architecture Overview

The architecture consists of a scene-specific geometry prediction module $\mathcal{G}$ that associates image pixels with 3D scene coordinates and a scene-agnostic map-relative pose regressor $\mathcal{M}$ , which predicts camera poses. The approach conditions pose predictions on scene-specific maps, enabling precise pose estimates without requiring large scene-specific training datasets.

Figure 1: Illustration of the network. A scene-specific geometry prediction module $\mathcal{G_S}$ processes a query image to predict a scene coordinate map $\hat{H}$ .

Dynamic Positional Encoding

The model uses a dynamic positional encoding mechanism that integrates camera intrinsics into the transformer-based regressor. This is achieved via camera-aware 2D positional embeddings and 3D positional embeddings, which help the network generalize across scenes and improve pose estimation significantly.

Loss Function and Fine-Tuning

The architecture uses an L1 regression loss on rotation and translation for training. Moreover, fine-tuning enhances the model's accuracy on specific scenes, optimizing the pre-trained map-relative regressor for tailored performance improvements with minimal additional training time.

Experimental Evaluation

The proposed method was tested against numerous state-of-the-art pose regression and correspondence-based methods on datasets like 7-Scenes and Wayspots.

Figure 2: Camera pose estimation performance vs. mapping time. The figure shows the median translation error of several pose regression relocalization methods on the 7-Scenes dataset.

Results on 7-Scenes

The approach achieved significant accuracy improvements over existing APR methods, requiring only minutes for scene-specific training. It demonstrated competitive performance to state-of-the-art structure-from-motion methods while maintaining superior efficiency.

Results on Wayspots

For challenging outdoor scenes, the method outperformed traditional APR approaches and exhibited accuracy comparable to advanced geometry-based methods, demonstrating its robustness and scalability across diverse environments.

Conclusion

Map-relative pose regression presents a scalable, efficient alternative to traditional absolute and relative pose regression methods, leveraging scene-specific geometry to achieve high-accuracy, real-time camera pose estimation. Its integration of 3D geometric knowledge and ability to generalize across scenes effectively addresses the limitations of prior approaches, offering significant potential for deployment in dynamic and large-scale applications.