Improved 3D Point-Line Mapping Regression for Camera Relocalization (2502.20814v1)

Published 28 Feb 2025 in cs.CV

Abstract: In this paper, we present a new approach for improving 3D point and line mapping regression for camera re-localization. Previous methods typically rely on feature matching (FM) with stored descriptors or use a single network to encode both points and lines. While FM-based methods perform well in large-scale environments, they become computationally expensive with a growing number of mapping points and lines. Conversely, approaches that learn to encode mapping features within a single network reduce memory footprint but are prone to overfitting, as they may capture unnecessary correlations between points and lines. We propose that these features should be learned independently, each with a distinct focus, to achieve optimal accuracy. To this end, we introduce a new architecture that learns to prioritize each feature independently before combining them for localization. Experimental results demonstrate that our approach significantly enhances the 3D map point and line regression performance for camera re-localization. The implementation of our method will be publicly available at: https://github.com/ais-lab/pl2map/.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper presents a novel architecture that separates point and line feature regression to improve camera relocalization accuracy.
It employs distinct regression branches and adaptive pruning to efficiently filter robust descriptors from keypoints and line segments.
Experiments on 7-Scenes and Indoor-6 datasets show enhanced relocalization performance and significantly improved FPS compared to prior methods.

The paper introduces an enhanced 3D point-line regression method for camera relocalization, addressing limitations in prior approaches that either rely on computationally expensive feature matching or struggle with overfitting due to single-network encoding of both points and lines. The central thesis posits that independently learning point and line features, each with distinct foci, optimizes accuracy. The method introduces a novel architecture prioritizing each feature independently before fusion for localization.

Key components and concepts in the paper include:

Problem Statement: The paper defines the camera relocalization problem given point-based Structure from Motion (SfM) model $\mathcal{S}^{p} \leftarrow \{\mathbf{P}_k \in \mathbb{R}^{3} \mid k = 1, 2, \dots, M\}$ and a line-based SfM model $\mathcal{S}^{l} \leftarrow \{\mathbf{L}_v \in \mathbb{R}^{6} \mid v = 1, 2, \dots, N\}$ , created using the same set of reference images $\{\mathbf{I}_{i}\}^{n}$ . $\mathcal{P}_i^{p} = \{\mathbf{d}_{i,j}^{p} \leftarrow \{k, \text{None}\} \mid j = 1, 2, \dots, M_{i}\}$ represents the set of point descriptors extracted in image $\mathbf{I}_i$ , where $k$ is the index of the corresponding 3D point $\mathbf{P}_k$ . Similarly, $\mathcal{P}_i^{l} = \{\mathbf{d}_{i,j}^{l} \leftarrow \{v, \text{None}\}\} \mid j = 1, 2, \dots, N_{i}\}$ represents the set of line descriptors from the same image. The approach introduces an adaptive regressor $\mathcal{F}(.)$ that takes $\mathcal{P}_i^{p}$ and $\mathcal{P}_i^{l}$ as inputs and learns to selectively output robust 3D coordinates to estimate the six degrees of freedom (6 DOF) camera pose $\mathbf{T} \in \mathbb{R}^{4\times4}$ for any new query image $\mathbf{I}$ from the same environment.
Front-End Processing: The architecture utilizes a pre-trained SuperPoint feature extractor for keypoint and keyline descriptors, avoiding the necessity for a separate line descriptor extractor.
Separate Regression Branches: Addressing the imbalance between point and line features, the method employs distinct regression streams for each to facilitate focused learning.
Point Regressor Branch:
- An early pruning layer filters extracted descriptors, using the following equation to compute the pruning probability for each descriptor:
  
  $\alpha_{j}^{p} = \text{Sigmoid}\bigl(\phi^{p}(\mathbf{d}_{j}^{p}) \bigl) \in [0,1]$
  - $\alpha_{j}^{p}$ is the pruning probability for each descriptor
  - $\phi^{p}$ is a Multi-Layer Perceptron (MLP) shared parameter for all point descriptors
  - $\mathbf{d}_{j}^{p}$ is a point descriptor for point $j$
- Descriptors with $\alpha_{j}^{p} \le \delta^{p}$ are pruned, retaining only robust descriptors with $\alpha_{j}^{p} > \delta^{p}$ for subsequent refinement via a multi-graph attention network.
- A multi-self-attention network refines reliable descriptors, updating each as follows:
  
  $\prescript{(m+1)}{}{\mathbf{d}_{j} = \prescript{(m)}{}{\mathbf{d}_{j} + \phi_{l}\bigg(\bigg[\prescript{(m)}{}{\mathbf{d}_{j}|| a_{m}\big(\prescript{(m)}{}{\mathbf{d}_{j}, \mathcal{E}\big) \bigg]\bigg)$
  - $\prescript{(m)}{}{\mathbf{d}_{j}}$ is the intermediate descriptor for element $j$ in layer $m$
  - $\mathcal{E}$ is the set of reliable descriptors in layer $m$
  - $a_{m}(.)$ is the self-attention mechanism
  - $\phi_{l}$ is an MLP
- A shared MLP maps descriptors to 3D coordinates.
Line Regressor Branch:
- Adapts a line descriptor encoder to derive low-dimensional descriptors from point-based descriptors extracted from keypoints.
- A single transformer model encodes $C$ descriptors sampled from a line segment $\mathbf{l}_{j} \in \mathbb{R}^{4}$ into a descriptor $\mathbf{d}^{l}_{j} \in \mathbb{R}^{256}$ , employing a simplified transformer encoder that omits positional encoders.
- A single self-attention layer refines line descriptors.
- A pruning layer removes unreliable lines, using reliability probability for line $j$ calculated as:
  
  $\alpha_{j}^{l} = \text{Sigmoid}\bigl(\phi^{l}(\prescript{(1)}{}{\mathbf{d}_{j}^{l})\bigl) \in [0,1]$
  - $\alpha_{j}^{l}$ is the reliability probability for line $j$
  - $\phi^{l}$ is a MLP
- Line descriptors with $\alpha_{j}^{l} > \delta^{l}$ are retained.
- A linear mapping transforms line descriptors to 3D segment coordinates.
Loss Function:
- The predicted 3D point $\hat{\mathbf{P}_{j}$ and line $\hat{\mathbf{L}_{j}$ are used to optimize the model using their ground truths $\mathbf{P}_{j}$ and $\mathbf{L}_{j}$ from SfM models, calculated for each image as follows:
 
 $\mathcal{L}_{m} = \sum_{j=1} \lVert \mathbf{P}_{j}-\hat{\mathbf{P}_{j} \lVert_{2} + \sum_{j=1} \lVert \mathbf{L}_{j}-\hat{\mathbf{L}_{j} \lVert_{2}$
 - $\mathcal{L}_{m}$ is the mapping loss function
 - $\mathbf{P}_{j}$ and $\mathbf{L}_{j}$ are the ground truth 3D point and line segment
 - $\hat{\mathbf{P}_{j}$ and $\hat{\mathbf{L}_{j}$ are the predicted 3D point and line segment
- Simultaneous optimization of pruning probability prediction is achieved using binary cross entropy (BCE) loss for both points and lines as:
 
 $\mathcal{L}_{\text{BCE} = \sum_{j=1}^{M} \mathcal{L}_{\text{BCE}^{p}(\hat{\alpha}_{j}^{p}, \alpha_{j}^{p}) + \sum_{j=1}^{N} \mathcal{L}_{\text{BCE}^{l} (\hat{\alpha}^{l}_{j}, \alpha^{l}_{j})$
 - $\mathcal{L}_{\text{BCE}}$ is the binary cross entropy loss function
 - $\mathcal{L}_{\text{BCE}^{p}$ is the binary cross entropy loss for points
 - $\mathcal{L}_{\text{BCE}^{l}$ is the binary cross entropy loss for lines
 - $\hat{\alpha}_{j}^{p}$ and $\hat{\alpha}^{l}_{j}$ are the predicted pruning probabilities
 - $\alpha_{j}^{p}$ and $\alpha^{l}_{j}$ are the ground truth pruning probabilities
- Reprojection of predicted 3D points and lines onto the image plane using available camera poses is used to further optimize the model:
 
 $\mathcal{L}_{\pi} = \sum_{j=1} \big\lVert \pi(\mathbf{T},\hat{\mathbf{P}_{j})-\mathbf{u}_{j}^{p}\big\lVert_{2} + \sum_{j=1} \psi\big(\pi(\mathbf{T},\hat{\mathbf{L}_{j}), \mathbf{u}_{j}^{l}\big)$
 - $\mathcal{L}_{\pi}$ is the reprojection loss
 - $\mathbf{T}$ is the ground truth pose
 - $\pi(.)$ is the reprojection function
 - $\mathbf{u}_{i}^{p} \in \mathbb{R}^{2}$ and $\mathbf{u}_{i}^{l} \in \mathbb{R}^{4}$ are the 2D positions of the point and line endpoints on the image
 - $\psi(.)$ calculates the distance between reprojected 3D line endpoints and their ground truth line coordinates $\mathbf{u}_{i}^{l}$
- A robust projection error term from is incorporated to mitigate non-convexity:
 
 $\mathcal{L}{\pi}{robust} = \begin{cases} 0, & \text{if } t < \theta \ \tau(t) \tanh\bigg(\frac{\mathcal{L}{\pi}{\tau(t)}\bigg), & \text{otherwise} \end{cases}$* $\mathcal{L}_{\pi}{robust}$ is the robust reprojection loss
 - $\theta$ is a threshold
 - $\tau(t)$ dynamically rescales $tanh(.)$
- The overall loss function is expressed as:
 
 $\mathcal{L}= \delta_{m}\mathcal{L}_{m} + \delta_{\text{BCE}\mathcal{L}_{\text{BCE} + \delta_{\pi}\mathcal{L}_{\pi}^{robust}$
 - $\mathcal{L}$ is the overall loss function
 - $\delta$ values are hyperparameter coefficients

The method was implemented in PyTorch, with specific network configurations for point and line branches. Experiments were conducted on the 7-Scenes dataset and the Indoor-6 dataset. The 7-Scenes dataset, a small-scale environment with rich point and line textures, was used to compare the proposed method against PtLine, Limap, and PL2Map. The Indoor-6 dataset, captured under varying conditions, served to evaluate the method's adaptability.

Results on 7-Scenes show improved localization performance across all scenes compared to PL2Map. On the Indoor-6 dataset, the method achieved the best re-localization accuracy among regression-based methods. Ablation studies validated the design choices. The method also performs well on lines detected by Line Segment Detector (LSD), improving computational efficiency. Specifically, the re-localization error increases only slightly when using LSD segments, while the average frames per second (FPS) improves significantly from 8.1 to 16.6.

The paper concludes by stating that the new architecture and training approach improves camera localization accuracy by focusing the network's attention on key features during training.