Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 33 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 465 tok/s Pro
Kimi K2 205 tok/s Pro
2000 character limit reached

Improved 3D Point-Line Mapping Regression for Camera Relocalization (2502.20814v1)

Published 28 Feb 2025 in cs.CV

Abstract: In this paper, we present a new approach for improving 3D point and line mapping regression for camera re-localization. Previous methods typically rely on feature matching (FM) with stored descriptors or use a single network to encode both points and lines. While FM-based methods perform well in large-scale environments, they become computationally expensive with a growing number of mapping points and lines. Conversely, approaches that learn to encode mapping features within a single network reduce memory footprint but are prone to overfitting, as they may capture unnecessary correlations between points and lines. We propose that these features should be learned independently, each with a distinct focus, to achieve optimal accuracy. To this end, we introduce a new architecture that learns to prioritize each feature independently before combining them for localization. Experimental results demonstrate that our approach significantly enhances the 3D map point and line regression performance for camera re-localization. The implementation of our method will be publicly available at: https://github.com/ais-lab/pl2map/.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a novel architecture that separates point and line feature regression to improve camera relocalization accuracy.
  • It employs distinct regression branches and adaptive pruning to efficiently filter robust descriptors from keypoints and line segments.
  • Experiments on 7-Scenes and Indoor-6 datasets show enhanced relocalization performance and significantly improved FPS compared to prior methods.

The paper introduces an enhanced 3D point-line regression method for camera relocalization, addressing limitations in prior approaches that either rely on computationally expensive feature matching or struggle with overfitting due to single-network encoding of both points and lines. The central thesis posits that independently learning point and line features, each with distinct foci, optimizes accuracy. The method introduces a novel architecture prioritizing each feature independently before fusion for localization.

Key components and concepts in the paper include:

  • Problem Statement: The paper defines the camera relocalization problem given point-based Structure from Motion (SfM) model Sp{PkR3k=1,2,,M}\mathcal{S}^{p} \leftarrow \{\mathbf{P}_k \in \mathbb{R}^{3} \mid k = 1, 2, \dots, M\} and a line-based SfM model Sl{LvR6v=1,2,,N}\mathcal{S}^{l} \leftarrow \{\mathbf{L}_v \in \mathbb{R}^{6} \mid v = 1, 2, \dots, N\}, created using the same set of reference images {Ii}n\{\mathbf{I}_{i}\}^{n}. Pip={di,jp{k,None}j=1,2,,Mi}\mathcal{P}_i^{p} = \{\mathbf{d}_{i,j}^{p} \leftarrow \{k, \text{None}\} \mid j = 1, 2, \dots, M_{i}\} represents the set of point descriptors extracted in image Ii\mathbf{I}_i, where kk is the index of the corresponding 3D point Pk\mathbf{P}_k. Similarly, Pil={di,jl{v,None}}j=1,2,,Ni}\mathcal{P}_i^{l} = \{\mathbf{d}_{i,j}^{l} \leftarrow \{v, \text{None}\}\} \mid j = 1, 2, \dots, N_{i}\} represents the set of line descriptors from the same image. The approach introduces an adaptive regressor F(.)\mathcal{F}(.) that takes Pip\mathcal{P}_i^{p} and Pil\mathcal{P}_i^{l} as inputs and learns to selectively output robust 3D coordinates to estimate the six degrees of freedom (6 DOF) camera pose TR4×4\mathbf{T} \in \mathbb{R}^{4\times4} for any new query image I\mathbf{I} from the same environment.
  • Front-End Processing: The architecture utilizes a pre-trained SuperPoint feature extractor for keypoint and keyline descriptors, avoiding the necessity for a separate line descriptor extractor.
  • Separate Regression Branches: Addressing the imbalance between point and line features, the method employs distinct regression streams for each to facilitate focused learning.
  • Point Regressor Branch:
    • An early pruning layer filters extracted descriptors, using the following equation to compute the pruning probability for each descriptor:

      αjp=Sigmoid(ϕp(djp))[0,1]\alpha_{j}^{p} = \text{Sigmoid}\bigl(\phi^{p}(\mathbf{d}_{j}^{p}) \bigl) \in [0,1]

      • αjp\alpha_{j}^{p} is the pruning probability for each descriptor
      • ϕp\phi^{p} is a Multi-Layer Perceptron (MLP) shared parameter for all point descriptors
      • djp\mathbf{d}_{j}^{p} is a point descriptor for point jj
    • Descriptors with αjpδp\alpha_{j}^{p} \le \delta^{p} are pruned, retaining only robust descriptors with αjp>δp\alpha_{j}^{p} > \delta^{p} for subsequent refinement via a multi-graph attention network.
    • A multi-self-attention network refines reliable descriptors, updating each as follows:

      $\prescript{(m+1)}{}{\mathbf{d}_{j} = \prescript{(m)}{}{\mathbf{d}_{j} + \phi_{l}\bigg(\bigg[\prescript{(m)}{}{\mathbf{d}_{j}|| a_{m}\big(\prescript{(m)}{}{\mathbf{d}_{j}, \mathcal{E}\big) \bigg]\bigg)$

      • $\prescript{(m)}{}{\mathbf{d}_{j}}$ is the intermediate descriptor for element jj in layer mm
      • E\mathcal{E} is the set of reliable descriptors in layer mm
      • am(.)a_{m}(.) is the self-attention mechanism
      • ϕl\phi_{l} is an MLP
    • A shared MLP maps descriptors to 3D coordinates.
  • Line Regressor Branch:
    • Adapts a line descriptor encoder to derive low-dimensional descriptors from point-based descriptors extracted from keypoints.
    • A single transformer model encodes CC descriptors sampled from a line segment ljR4\mathbf{l}_{j} \in \mathbb{R}^{4} into a descriptor djlR256\mathbf{d}^{l}_{j} \in \mathbb{R}^{256}, employing a simplified transformer encoder that omits positional encoders.
    • A single self-attention layer refines line descriptors.
    • A pruning layer removes unreliable lines, using reliability probability for line jj calculated as:

      $\alpha_{j}^{l} = \text{Sigmoid}\bigl(\phi^{l}(\prescript{(1)}{}{\mathbf{d}_{j}^{l})\bigl) \in [0,1]$

      • αjl\alpha_{j}^{l} is the reliability probability for line jj
      • ϕl\phi^{l} is a MLP
    • Line descriptors with αjl>δl\alpha_{j}^{l} > \delta^{l} are retained.
    • A linear mapping transforms line descriptors to 3D segment coordinates.
  • Loss Function:
    • The predicted 3D point $\hat{\mathbf{P}_{j}$ and line $\hat{\mathbf{L}_{j}$ are used to optimize the model using their ground truths Pj\mathbf{P}_{j} and Lj\mathbf{L}_{j} from SfM models, calculated for each image as follows:

      $\mathcal{L}_{m} = \sum_{j=1} \lVert \mathbf{P}_{j}-\hat{\mathbf{P}_{j} \lVert_{2} + \sum_{j=1} \lVert \mathbf{L}_{j}-\hat{\mathbf{L}_{j} \lVert_{2}$

      • Lm\mathcal{L}_{m} is the mapping loss function
      • Pj\mathbf{P}_{j} and Lj\mathbf{L}_{j} are the ground truth 3D point and line segment
      • $\hat{\mathbf{P}_{j}$ and $\hat{\mathbf{L}_{j}$ are the predicted 3D point and line segment
    • Simultaneous optimization of pruning probability prediction is achieved using binary cross entropy (BCE) loss for both points and lines as:

      $\mathcal{L}_{\text{BCE} = \sum_{j=1}^{M} \mathcal{L}_{\text{BCE}^{p}(\hat{\alpha}_{j}^{p}, \alpha_{j}^{p}) + \sum_{j=1}^{N} \mathcal{L}_{\text{BCE}^{l} (\hat{\alpha}^{l}_{j}, \alpha^{l}_{j})$

      • LBCE\mathcal{L}_{\text{BCE}} is the binary cross entropy loss function
      • $\mathcal{L}_{\text{BCE}^{p}$ is the binary cross entropy loss for points
      • $\mathcal{L}_{\text{BCE}^{l}$ is the binary cross entropy loss for lines
      • α^jp\hat{\alpha}_{j}^{p} and α^jl\hat{\alpha}^{l}_{j} are the predicted pruning probabilities
      • αjp\alpha_{j}^{p} and αjl\alpha^{l}_{j} are the ground truth pruning probabilities
    • Reprojection of predicted 3D points and lines onto the image plane using available camera poses is used to further optimize the model:

      $\mathcal{L}_{\pi} = \sum_{j=1} \big\lVert \pi(\mathbf{T},\hat{\mathbf{P}_{j})-\mathbf{u}_{j}^{p}\big\lVert_{2} + \sum_{j=1} \psi\big(\pi(\mathbf{T},\hat{\mathbf{L}_{j}), \mathbf{u}_{j}^{l}\big)$

      • Lπ\mathcal{L}_{\pi} is the reprojection loss
      • T\mathbf{T} is the ground truth pose
      • π(.)\pi(.) is the reprojection function
      • uipR2\mathbf{u}_{i}^{p} \in \mathbb{R}^{2} and uilR4\mathbf{u}_{i}^{l} \in \mathbb{R}^{4} are the 2D positions of the point and line endpoints on the image
      • ψ(.)\psi(.) calculates the distance between reprojected 3D line endpoints and their ground truth line coordinates uil\mathbf{u}_{i}^{l}
    • A robust projection error term from is incorporated to mitigate non-convexity:

      $\mathcal{L}<em>{\pi}<sup>{robust}</sup> = \begin{cases} 0, &amp; \text{if } t &lt; \theta \ \tau(t) \tanh\bigg(\frac{\mathcal{L}</em>{\pi}{\tau(t)}\bigg), &amp; \text{otherwise} \end{cases}$*Lπ<sup>robust\mathcal{L}_{\pi}<sup>{robust} is the robust reprojection loss

      • θ\theta is a threshold
      • τ(t)\tau(t) dynamically rescales tanh(.)tanh(.)
    • The overall loss function is expressed as:

      $\mathcal{L}= \delta_{m}\mathcal{L}_{m} + \delta_{\text{BCE}\mathcal{L}_{\text{BCE} + \delta_{\pi}\mathcal{L}_{\pi}^{robust}$

      • L\mathcal{L} is the overall loss function
      • δ\delta values are hyperparameter coefficients

The method was implemented in PyTorch, with specific network configurations for point and line branches. Experiments were conducted on the 7-Scenes dataset and the Indoor-6 dataset. The 7-Scenes dataset, a small-scale environment with rich point and line textures, was used to compare the proposed method against PtLine, Limap, and PL2Map. The Indoor-6 dataset, captured under varying conditions, served to evaluate the method's adaptability.

Results on 7-Scenes show improved localization performance across all scenes compared to PL2Map. On the Indoor-6 dataset, the method achieved the best re-localization accuracy among regression-based methods. Ablation studies validated the design choices. The method also performs well on lines detected by Line Segment Detector (LSD), improving computational efficiency. Specifically, the re-localization error increases only slightly when using LSD segments, while the average frames per second (FPS) improves significantly from 8.1 to 16.6.

The paper concludes by stating that the new architecture and training approach improves camera localization accuracy by focusing the network's attention on key features during training.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube