LiftFeat: 3D Geometry-Aware Local Feature Matching (2505.03422v1)

Published 6 May 2025 in cs.CV and cs.RO

Abstract: Robust and efficient local feature matching plays a crucial role in applications such as SLAM and visual localization for robotics. Despite great progress, it is still very challenging to extract robust and discriminative visual features in scenarios with drastic lighting changes, low texture areas, or repetitive patterns. In this paper, we propose a new lightweight network called \textit{LiftFeat}, which lifts the robustness of raw descriptor by aggregating 3D geometric feature. Specifically, we first adopt a pre-trained monocular depth estimation model to generate pseudo surface normal label, supervising the extraction of 3D geometric feature in terms of predicted surface normal. We then design a 3D geometry-aware feature lifting module to fuse surface normal feature with raw 2D descriptor feature. Integrating such 3D geometric feature enhances the discriminative ability of 2D feature description in extreme conditions. Extensive experimental results on relative pose estimation, homography estimation, and visual localization tasks, demonstrate that our LiftFeat outperforms some lightweight state-of-the-art methods. Code will be released at : https://github.com/lyp-deeplearning/LiftFeat.

Summary

The paper introduces LiftFeat, a lightweight neural network that fuses 2D descriptors with 3D surface normals to robustly match local features under challenging visual conditions.
It employs a multi-task architecture with dedicated heads for keypoint detection, descriptor extraction, and normal estimation, trained using pseudo-surface normals from a depth model.
Experimental results show significant improvements in pose, homography, and visual localization tasks, offering an efficient solution for real-world robotics applications.

This paper introduces LiftFeat, a lightweight neural network designed to improve local feature matching by incorporating 3D geometric information, specifically surface normals. The core problem addressed is the unreliability of 2D visual cues in challenging scenarios like drastic lighting changes, low-texture areas, or scenes with repetitive patterns, which can lead to incorrect feature matches. LiftFeat aims to enhance the discriminative ability of 2D descriptors by fusing them with 3D geometric features, making them more robust in these extreme conditions.

The practical application of this research lies in robotics, particularly for tasks like Simultaneous Localization and Mapping (SLAM) and visual localization, where robust and efficient feature matching is critical, often on computationally constrained platforms.

Implementation Details

Network Architecture:

LiftFeat employs a shared feature encoding module and multiple task-specific heads for keypoint detection, descriptor extraction, and surface normal estimation.

Feature Encoding: An input image $I \in \mathbb{R}^{W \times H \times 3}$ passes through 5 convolutional blocks with max-pooling. The feature map depths increase (4, 8, 16, 32, 64). A fusion block then combines features from Block3, Block4, and Block5 using $1 \times 1$ convolutions and bilinear interpolation, resulting in a fused feature map of size $\frac{W}{8} \times \frac{H}{8} \times 64$ .
Multi-task Head:
- Keypoint Head: A $1 \times 1$ convolution generates a keypoint map of $\frac{H}{8} \times \frac{W}{8} \times (64+1)$ . Channel-wise softmax yields the score distribution.
- Descriptor Head: Bilinear interpolation and $L_2$ -normalization produce a descriptor map of $W \times H \times 64$ .
- Normal Head: Bilinear interpolation produces a 3-channel normal map at the original image resolution.

3D Geometric Knowledge Supervision:

To train the surface normal estimation head without requiring manually annotated 3D data, LiftFeat uses pseudo surface normal labels.

A pre-trained monocular depth estimation model, Depth Anything v2, predicts a depth map $Z_I$ from the input image $I$ .
Surface normals $\mathbf{n}_P$ are calculated from this depth map. For a pixel $P(u, v)$ , the depth gradients $\frac{\partial Z_I}{\partial u}$ and $\frac{\partial Z_I}{\partial v}$ are estimated using finite differences:

$\frac{\partial Z_I}{\partial u} \approx Z_I(u+1, v) - Z_I(u-1, v)$

$\frac{\partial Z_I}{\partial v} \approx Z_I(u, v+1) - Z_I(u, v-1)$
The normalized surface normal vector is then:

$\mathbf{n}_P = \frac{(-\frac{\partial Z_I}{\partial u}, -\frac{\partial Z_I}{\partial v}, 1)}{\left\| \left( -\frac{\partial Z_I}{\partial u}, -\frac{\partial Z_I}{\partial v}, 1 \right) \right\|}$

This provides a scale and translation invariant 3D cue.

3D Geometry-aware Feature Lifting (3D-GFL) Module:

This module fuses the 2D descriptors with the 3D normal features at detected keypoint locations.

For $N$ keypoints $p \in \mathbb{R}^{N \times 2}$ (obtained via Non-Maximum Suppression, NMS), corresponding descriptors $d \in \mathbb{R}^{N \times 64}$ and normal vectors $n \in \mathbb{R}^{N \times 3}$ are sampled.
The dimensions of descriptors and normals are aligned using separate Multi-Layer Perceptrons (MLPs), and then summed.
Positional Encoding (PE) is applied to integrate keypoint location information:

$\mathbf{m}_i = PE (p_i) \odot ({MLP}_{2D}(\mathbf{d}_i) + {MLP}_{3D}(\mathbf{n}_i))$

where $\mathbf{m}_i$ is the mixed feature for keypoint $i$ .
Stacked self-attention layers (using linear transformers for efficiency) are applied to these mixed features $\mathbf{m}_i$ to allow interaction and aggregation, producing the final lifted descriptors $d^l \in \mathbb{R}^{N \times 64}$ . The self-attention mechanism is defined as:

$m_i^{n+1} = (m_i^{n}W_{m_i}^q) \odot \sum_{j \in P} \operatorname{Softmax}(m_j^{n}W_{m_j}^k) \odot (m_j^{n}W_{m_j}^v)$

Three self-attention layers are used.

An overview of the LiftFeat architecture is shown in Figure 2 of the paper.

Input Image --> Feature Encoding --> Multi-task Heads --> Output Maps
                                      |
                                      +-- Keypoint Map
                                      |
                                      +-- Descriptor Map
                                      |
                                      +-- Normal Map (supervised by pseudo-normals)

Keypoints, Descriptors, Normals --> 3D-GFL Module --> Lifted Descriptors
(Sampled at keypoint locations)       |
                                      +-- MLP alignment
                                      +-- Positional Encoding
                                      +-- Self-Attention Layers

Lifted Descriptors --> Feature Matching

Network Training:

The network is trained end-to-end using a composite loss function on paired images $(I_A, I_B)$ .

Keypoint Loss ( $L_{keypoint}$ ): Uses ALIKE detector output as ground-truth. Negative Log-Likelihood (NLL) loss is applied to the keypoint logits map.
Normal Loss ( $L_{normal}$ ): Cosine similarity between predicted normals $\mathbf{n}_{\text{pred}}$ and pseudo ground-truth normals $\mathbf{n}_{\text{gt}}$ :

$L_{\text{normal}} = 1 - \frac{\mathbf{n}_{\text{pred}} \cdot \mathbf{n}_{\text{gt}}}{\|\mathbf{n}_{\text{pred}}\| \|\mathbf{n}_{\text{gt}}\|}$
Descriptor Loss ( $L_{desc}$ ): Based on SuperGlue, it minimizes the negative log-likelihood of the predicted matching score matrix $S$ with respect to the ground-truth matching matrix $M_{\text{gt}}$ :

$L_{\text{desc}} = -\sum_{i,j} M_{\text{gt}}(i,j) \log S(i,j)$
Total Loss ( $L_{total}$ ): A weighted sum:

$L_{\text{total}} = L_{\text{keypoint}} + \alpha_1 L_{\text{normal}} + \alpha_2 L_{\text{desc}}$

Empirically, $\alpha_1=2$ and $\alpha_2=1$ .

Training was done on a mixed dataset of MegaDepth and synthetic COCO, with input image size 800x600, Adam optimizer, initial learning rate 1e-4, and batch size 16. 1024 matching point pairs were sampled for fine-tuning the feature aggregation module.

Experimental Evaluation and Results

LiftFeat was evaluated on three tasks: relative pose estimation, homography estimation, and visual localization. It was compared against lightweight methods like ORB, SuperPoint, ALIKE (Tiny), SiLK (VGG-aligning), and XFeat. For all methods, the top 4096 keypoints were used with mutual nearest neighbor (MNN) search.

Relative Pose Estimation:
- Datasets: MegaDepth-1500 (outdoor), ScanNet (indoor).
- Metrics: AUC of translation and rotation errors at $5^\circ, 10^\circ, 20^\circ$ thresholds.
- Results: LiftFeat outperformed XFeat and SuperPoint, demonstrating significant improvements in AUC scores across thresholds on both datasets (Table I). For example, on MegaDepth-1500, LiftFeat achieved AUC@5° of 44.7, compared to XFeat's 42.6. On ScanNet, LiftFeat achieved AUC@5° of 18.5, compared to XFeat's 16.7.
Homography Estimation:
- Dataset: HPatches (planar sequences with illumination and viewpoint changes).
- Metrics: Mean Homography Accuracy (MHA) at pixel error thresholds of 3, 5, 7.
- Results: LiftFeat generally outperformed other methods, especially under large viewpoint changes where 3D information helps mitigate appearance distortions (Table II). For viewpoint changes at 7px threshold, LiftFeat achieved 87.5 MHA, compared to XFeat's 86.1.
Visual Localization:
- Dataset: Aachen Day-Night v1.1 (challenging illumination changes).
- Metrics: Pose recall at (0.25m/2°, 0.5m/5°, 5m/10°) error thresholds.
- Results: LiftFeat outperformed ALIKE and XFeat in both day and night scenarios. Notably, in nighttime at (0.25m/2°) threshold, it improved recall to 82.1% from SuperPoint's 77.6% (Table III), suggesting 3D cues are particularly beneficial in low-light conditions.

Ablation Study:

Conducted on the Aachen Day-Night (night subset) to evaluate the impact of the normal head and the 3D-GFL module (Table IV).

Baseline (keypoint + raw description): 78.9% recall at (0.25m, 2°).
- Normal Head (implicit 3D learning): 79.4% recall.
- 3D-GFL (explicit feature fusion): 82.1% recall. This shows that both components contribute to the performance improvement, with explicit fusion via 3D-GFL being more impactful.

Runtime Analysis:

Compared resource requirements on an Intel i7-10700 CPU and Nvidia Xavier NX GPU for VGA input (Table V).

Params: LiftFeat (0.85M) is more lightweight than SuperPoint (1.30M) but larger than XFeat (0.66M).
FLOPs: LiftFeat (4.96G) is significantly less than SuperPoint (19.85G) but more than XFeat (1.33G).
Runtime (GPU): LiftFeat (7.4 ms) is faster than SuperPoint (36 ms) and slightly slower than XFeat (5.6 ms). LiftFeat offers a good trade-off between accuracy and speed, being significantly faster and more accurate than SuperPoint, and more accurate than XFeat with a modest increase in inference time.

Contributions and Conclusion

The main contributions are:

Proposing LiftFeat, a lightweight network that introduces 3D geometry (surface normals) for local feature matching.
Designing a 3D Geometry-aware Feature Lifting (3D-GFL) module to fuse 2D descriptors with 3D normal features, enhancing discriminability in challenging scenes.
Demonstrating state-of-the-art performance on multiple benchmarks while maintaining efficiency suitable for edge devices.

The paper concludes that integrating 3D geometric features via learned surface normals significantly enhances the robustness and discriminative power of 2D local features, particularly in extreme visual conditions. The use of a pre-trained depth model for pseudo-label generation avoids costly manual annotation for 3D supervision. LiftFeat offers a practical solution for improving feature matching in real-world robotics applications. The qualitative results (Figure 3) visually confirm LiftFeat's improved matching in low-texture, repetitive pattern, and lighting variation scenarios.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - lyp-deeplearning/LiftFeat (1 star)

Tweets

https://twitter.com/rsasaki0109/status/1927108642647474594

https://twitter.com/zhenjun_zhao/status/1920067742113095729