- The paper presents a novel architecture that separates point and line feature regression to improve camera relocalization accuracy.
- It employs distinct regression branches and adaptive pruning to efficiently filter robust descriptors from keypoints and line segments.
- Experiments on 7-Scenes and Indoor-6 datasets show enhanced relocalization performance and significantly improved FPS compared to prior methods.
The paper introduces an enhanced 3D point-line regression method for camera relocalization, addressing limitations in prior approaches that either rely on computationally expensive feature matching or struggle with overfitting due to single-network encoding of both points and lines. The central thesis posits that independently learning point and line features, each with distinct foci, optimizes accuracy. The method introduces a novel architecture prioritizing each feature independently before fusion for localization.
Key components and concepts in the paper include:
- Problem Statement: The paper defines the camera relocalization problem given point-based Structure from Motion (SfM) model Sp←{Pk∈R3∣k=1,2,…,M} and a line-based SfM model Sl←{Lv∈R6∣v=1,2,…,N}, created using the same set of reference images {Ii}n. Pip={di,jp←{k,None}∣j=1,2,…,Mi} represents the set of point descriptors extracted in image Ii, where k is the index of the corresponding 3D point Pk. Similarly, Pil={di,jl←{v,None}}∣j=1,2,…,Ni} represents the set of line descriptors from the same image. The approach introduces an adaptive regressor F(.) that takes Pip and Pil as inputs and learns to selectively output robust 3D coordinates to estimate the six degrees of freedom (6 DOF) camera pose T∈R4×4 for any new query image I from the same environment.
- Front-End Processing: The architecture utilizes a pre-trained SuperPoint feature extractor for keypoint and keyline descriptors, avoiding the necessity for a separate line descriptor extractor.
- Separate Regression Branches: Addressing the imbalance between point and line features, the method employs distinct regression streams for each to facilitate focused learning.
- Point Regressor Branch:
An early pruning layer filters extracted descriptors, using the following equation to compute the pruning probability for each descriptor:
αjp=Sigmoid(ϕp(djp))∈[0,1]
- αjp is the pruning probability for each descriptor
- ϕp is a Multi-Layer Perceptron (MLP) shared parameter for all point descriptors
- djp is a point descriptor for point j
- Descriptors with αjp≤δp are pruned, retaining only robust descriptors with αjp>δp for subsequent refinement via a multi-graph attention network.
- A multi-self-attention network refines reliable descriptors, updating each as follows:
$\prescript{(m+1)}{}{\mathbf{d}_{j} = \prescript{(m)}{}{\mathbf{d}_{j} + \phi_{l}\bigg(\bigg[\prescript{(m)}{}{\mathbf{d}_{j}|| a_{m}\big(\prescript{(m)}{}{\mathbf{d}_{j}, \mathcal{E}\big) \bigg]\bigg)$
- $\prescript{(m)}{}{\mathbf{d}_{j}}$ is the intermediate descriptor for element j in layer m
- E is the set of reliable descriptors in layer m
- am(.) is the self-attention mechanism
- ϕl is an MLP
- A shared MLP maps descriptors to 3D coordinates.
- Line Regressor Branch:
- Adapts a line descriptor encoder to derive low-dimensional descriptors from point-based descriptors extracted from keypoints.
- A single transformer model encodes C descriptors sampled from a line segment lj∈R4 into a descriptor djl∈R256, employing a simplified transformer encoder that omits positional encoders.
- A single self-attention layer refines line descriptors.
A pruning layer removes unreliable lines, using reliability probability for line j calculated as:
$\alpha_{j}^{l} = \text{Sigmoid}\bigl(\phi^{l}(\prescript{(1)}{}{\mathbf{d}_{j}^{l})\bigl) \in [0,1]$
- αjl is the reliability probability for line j
- ϕl is a MLP
- Line descriptors with αjl>δl are retained.
- A linear mapping transforms line descriptors to 3D segment coordinates.
- Loss Function:
The predicted 3D point $\hat{\mathbf{P}_{j}$ and line $\hat{\mathbf{L}_{j}$ are used to optimize the model using their ground truths Pj and Lj from SfM models, calculated for each image as follows:
$\mathcal{L}_{m} = \sum_{j=1} \lVert \mathbf{P}_{j}-\hat{\mathbf{P}_{j} \lVert_{2} + \sum_{j=1} \lVert \mathbf{L}_{j}-\hat{\mathbf{L}_{j} \lVert_{2}$
- Lm is the mapping loss function
- Pj and Lj are the ground truth 3D point and line segment
- $\hat{\mathbf{P}_{j}$ and $\hat{\mathbf{L}_{j}$ are the predicted 3D point and line segment
- Simultaneous optimization of pruning probability prediction is achieved using binary cross entropy (BCE) loss for both points and lines as:
$\mathcal{L}_{\text{BCE} = \sum_{j=1}^{M} \mathcal{L}_{\text{BCE}^{p}(\hat{\alpha}_{j}^{p}, \alpha_{j}^{p}) + \sum_{j=1}^{N} \mathcal{L}_{\text{BCE}^{l} (\hat{\alpha}^{l}_{j}, \alpha^{l}_{j})$
- LBCE is the binary cross entropy loss function
- $\mathcal{L}_{\text{BCE}^{p}$ is the binary cross entropy loss for points
- $\mathcal{L}_{\text{BCE}^{l}$ is the binary cross entropy loss for lines
- α^jp and α^jl are the predicted pruning probabilities
- αjp and αjl are the ground truth pruning probabilities
- Reprojection of predicted 3D points and lines onto the image plane using available camera poses is used to further optimize the model:
$\mathcal{L}_{\pi} = \sum_{j=1} \big\lVert \pi(\mathbf{T},\hat{\mathbf{P}_{j})-\mathbf{u}_{j}^{p}\big\lVert_{2} + \sum_{j=1} \psi\big(\pi(\mathbf{T},\hat{\mathbf{L}_{j}), \mathbf{u}_{j}^{l}\big)$
- Lπ is the reprojection loss
- T is the ground truth pose
- π(.) is the reprojection function
- uip∈R2 and uil∈R4 are the 2D positions of the point and line endpoints on the image
- ψ(.) calculates the distance between reprojected 3D line endpoints and their ground truth line coordinates uil
- A robust projection error term from is incorporated to mitigate non-convexity:
$\mathcal{L}<em>{\pi}<sup>{robust}</sup> =
\begin{cases}
0, & \text{if } t < \theta \
\tau(t) \tanh\bigg(\frac{\mathcal{L}</em>{\pi}{\tau(t)}\bigg), & \text{otherwise}
\end{cases}$*Lπ<sup>robust is the robust reprojection loss
- θ is a threshold
- τ(t) dynamically rescales tanh(.)
- The overall loss function is expressed as:
$\mathcal{L}= \delta_{m}\mathcal{L}_{m} + \delta_{\text{BCE}\mathcal{L}_{\text{BCE} + \delta_{\pi}\mathcal{L}_{\pi}^{robust}$
- L is the overall loss function
- δ values are hyperparameter coefficients
The method was implemented in PyTorch, with specific network configurations for point and line branches. Experiments were conducted on the 7-Scenes dataset and the Indoor-6 dataset. The 7-Scenes dataset, a small-scale environment with rich point and line textures, was used to compare the proposed method against PtLine, Limap, and PL2Map. The Indoor-6 dataset, captured under varying conditions, served to evaluate the method's adaptability.
Results on 7-Scenes show improved localization performance across all scenes compared to PL2Map. On the Indoor-6 dataset, the method achieved the best re-localization accuracy among regression-based methods. Ablation studies validated the design choices. The method also performs well on lines detected by Line Segment Detector (LSD), improving computational efficiency. Specifically, the re-localization error increases only slightly when using LSD segments, while the average frames per second (FPS) improves significantly from 8.1 to 16.6.
The paper concludes by stating that the new architecture and training approach improves camera localization accuracy by focusing the network's attention on key features during training.