EDM: Efficient Deep Feature Matching (2503.05122v2)

Published 7 Mar 2025 in cs.CV

Abstract: Recent feature matching methods have achieved remarkable performance but lack efficiency consideration. In this paper, we revisit the mainstream detector-free matching pipeline and improve all its stages considering both accuracy and efficiency. We propose an Efficient Deep feature Matching network, EDM. We first adopt a deeper CNN with fewer dimensions to extract multi-level features. Then we present a Correlation Injection Module that conducts feature transformation on high-level deep features, and progressively injects feature correlations from global to local for efficient multi-scale feature aggregation, improving both speed and performance. In the refinement stage, a novel lightweight bidirectional axis-based regression head is designed to directly predict subpixel-level correspondences from latent features, avoiding the significant computational cost of explicitly locating keypoints on high-resolution local feature heatmaps. Moreover, effective selection strategies are introduced to enhance matching accuracy. Extensive experiments show that our EDM achieves competitive matching accuracy on various benchmarks and exhibits excellent efficiency, offering valuable best practices for real-world applications. The code is available at https://github.com/chicleee/EDM.

Summary

The paper introduces EDM, an Efficient Deep feature Matching network that improves speed and accuracy in detector-free matching through a novel pipeline.
EDM utilizes a Correlation Injection Module (CIM) for hierarchical feature correlation and a lightweight bidirectional Axis-Based Regression Head (ABRHead) for subpixel match estimation.
Experimental results show EDM achieves state-of-the-art performance and high efficiency on benchmarks like MegaDepth, ScanNet, and HPatches for tasks including pose and homography estimation.

The paper introduces an Efficient Deep feature Matching network (EDM) that enhances the speed and accuracy of feature matching by improving each stage of the conventional detector-free matching pipeline. The key components of EDM include a deep but narrow CNN backbone, a Correlation Injection Module (CIM), and a bidirectional axis-based regression head. The results show that EDM achieves state-of-the-art performance with high efficiency, offering valuable practices for real-world applications.

The contributions of this work are:

A new detector-free matcher is presented that improves efficiency while maintaining accuracy through paradigm redesign.
The CIM models deep feature correlations with high-level context information and integrates global and local features through hierarchical correlation injection.
A lightweight bidirectional axis-based regression head is introduced for implicit subpixel-level match estimation.
Selection strategies are presented to improve both coarse and fine stage accuracy.

The method begins with a ResNet-18 backbone to extract multi-level feature maps $F_{d}^{A}$ and $F_{d}^{B}$ at a $\frac{1}{32}$ scale, balancing efficiency with the capture of high-level contextual information.

The CIM aggregates multi-scale features before coarse matching, using stacked Transformers and Injection Layers (ILs). Deep feature maps are transformed by alternating self-attention and cross-attention $L$ times, capturing feature correlations between images. The 2D rotary positional embedding (RoPE) captures relative spatial information in self-attention layers. Query-Key Normalized Attention (QKNA) replaces vanilla attention to enhance correlation modeling. The QKNA is defined as:

$QKNormAtt(Q,K,V)=softmax(s \cdot \hat{Q}\hat{K}^{T})V$

where:

$Q$ is the query
$K$ is the key
$V$ is the value
$s$ is a manual scale factor
$\hat{Q}$ and $\hat{K}$ are obtained by applying L2 normalization in the head dimensions.

Two cascaded ILs upsample features to a $\frac{1}{8}$ scale, taking backbone local features and deep features with global correlations as inputs. Local features pass through a convolution and batch normalization layer (CB). Low-resolution deep features are fed into a CBA block (convolution, batch normalization, and sigmoid activation function) to generate weights determining local feature detail retention. The output is upsampled and injected into local features via element-wise product. Global features pass through another CB block and bilinear interpolation upsampling, then are element-wise added to the injected features. A 3 $\times$ 3 depthwise convolution (DW) alleviates upsampling aliasing.

Coarse-level matches are established from the coarse feature maps ${F}_{c}^{A}$ and ${F}_{c}^{B}$ after correlation injection. Each pixel represents an 8 $\times$ 8 grid region in the original images. The feature maps ${F}_{c}^{A}$ and ${F}_{c}^{B}$ are flattened to 1-D vectors ${\tilde{F}_{c}^{A}$ and ${\tilde{F}_{c}^{B}$, and a similarity matrix $\mathcal{S}$ is built using the inner product:

$\mathcal{S} = \frac{\left \langle {\tilde{F}_{c}^{A}(i), {\tilde{F}_{c}^{B}(j)\right \rangle}{\tau}$

where:

$\mathcal{S}$ is the similarity matrix
$\tilde{F}_{c}^{A}(i)$ is the flattened feature vector for image A at index i
$\tilde{F}_{c}^{B}(j)$ is the flattened feature vector for image B at index j
$\tau$ is the temperature parameter.

The matching probability matrix $\mathcal{P}_{c}$ is obtained by a dual-softmax operator:

$\mathcal{P}_{c} = softmax(\mathcal{S}(i,\cdots))_{j} \cdot softmax(\mathcal{S}(\cdots,j))_{i}$

This can be efficiently implemented by:

$\mathcal{P}_{c} = \frac{\mathcal{Z}{\left \|\mathcal{Z}(i,\cdots)_{j} \right \|_{1}\cdot \frac{\mathcal{Z}}{\left \|\mathcal{Z}(\cdots,j)_{i} \right \|_{1}}$

where:

$\mathcal{Z} = e^{\mathcal{S}}$

Maximum values are obtained from each row of $\mathcal{P}_{c}$ , and the Top-K scoring values are selected, with probabilities exceeding a threshold $\theta _{c}$ .

For fine-level matching, offsets are regressed directly from latent features, avoiding pixel-level keypoint localization. Backbone features $F_{f}^{A}$ , $F_{f}^{B}$ and coarse-level features ${F}_{c}^{A}$ , ${F}_{c}^{B}$ are summed element-wise and used as inputs. Fine-level corresponding features are extracted using coarse matching indices and flattened to 1-D vectors $F^{A}$ , $F^{B}$ . The central pixel of grids are considered keypoints $P^{A}$ , $P^{B}$ with descriptors $F^{A}$ , $F^{B}$ . A bidirectional refinement strategy obtains double fine matches: $F^{A}$ , $F^{B}$ are concatenated as query features $F_{q}^{A}$ , $F_{q}^{B}$ , and reference features $F_{r}^{B}$ , $F_{r}^{A}$ are in the reverse order. They pass through query and reference encoders, each a lightweight Multi-Layer Perceptron (MLP). The corresponding features are concatenated and merged through another MLP.

A lightweight Axis-Based Regression Head (ABRHead) with Soft Coordinate Classification (SCC) is used. The merged feature passes through linear layers to reduce output dimension to $N$ +1. The $N$ -D tensor passes through soft-argmax to predict a location parameter $\mu$ . Another 1-D tensor passes through a sigmoid to predict a scale parameter $\mathcal{\sigma}$ . The output $\mu$ and $\mathcal{\sigma}$ shift and scale the distribution generated by a flow model. The predict $\mu$ is equivalent to the normalized offset $\Delta$ .

The prediction confidence is obtained by:

$\mathcal{P}_{f} = 1-\frac{\sigma _{x}+\sigma _{y}}{2}$

where:

$\sigma _{x}$ is the $\sigma$ on X-axis
$\sigma _{y}$ is the $\sigma$ on Y-axis

For each bidirectional matching pair, the more confident one is kept if above the fine-level threshold $\theta _{f}$ .

The coarse-level ground truth matches $\mathcal{M}_{c}$ are generated by warping grid centroids from input image ${I}^{A}$ to ${I}^{B}$ using relative camera poses and depth maps at $\frac{1}{8}$ scale. The matching probability matrix $\mathcal{P}_{c}$ is supervised by minimizing the focal loss:

$\mathcal{L}_{c} = - \frac{1}{\left | \mathcal{M}_{c}\right |}\displaystyle\sum_{\left \langle i,j\right \rangle \in \mathcal{M}_{c}^{} \alpha \left ( 1- \mathcal{P}_{c}{\left \langle i,j\right \rangle}\right )^{\gamma}\log_{}{\mathcal{P}_{c}{\left \langle i,j\right \rangle}}$

where:

$\mathcal{L}_{c}$ is the coarse-level loss
$\mathcal{M}_{c}$ is the ground truth matches
$\alpha$ is the weighting factor
$\gamma$ is the focusing parameter

The residual log-likelihood estimation (RLE) loss is employed to improve offset regression performance:

$\mathcal{L}_{f} = -\log\mathcal{G}_{\phi}\left ( \hat{x}\right ) -\log\mathcal{Q}_{\phi}\left ( \hat{x}\right ) + \log\mathcal{\sigma}$

where:

$\mathcal{G}_{\phi}\left ( \hat{x}\right )$ is the distribution learned by the normalizing flow model ${\phi}$
$\mathcal{Q}_{\phi}\left ( \hat{x}\right )$ is a simple Laplace distribution
$\mathcal{\sigma}$ is the prediction scale parameter

The Laplace distribution loss item about $\mathcal{Q}_{\phi}\left ( \hat{x}\right )$ is defined as:

$\mathcal{Q}_{\phi}\left ( \hat{x}\right )= \displaystyle\sum_{\mathcal{M}_{f}^{}\frac{1}{\sigma}e^{-\frac{\left | \mu^{gt} -\mu \right |}{2\sigma}}$

where:

$\mu^{gt}$ is the corresponding ground truth offsets.
${\mathcal{M}_{f}$ is the ground truth fine-level matches

The total loss is:

$\mathcal{L} = \lambda _{c}\mathcal{L}_{c} + \lambda _{f} \mathcal{L}_{f}$

where:

$\lambda _{c}$ is coarse-level loss weight
$\lambda _{f}$ is fine-level loss weight

The backbone feature widths from $\frac{1}{2}$ scale to $\frac{1}{32}$ scale are [32, 64, 128, 256, 256]. $L$ is set to 2 in the CIM. The coordinate bins number $N$ in ABRHead is 16. The attention scale factor $s$ is set to 20. The training utilizes the AdamW optimizer with an initial learning rate of 2 $e$ -3 and a batch size of 32.

Experiments were performed for relative pose estimation, homography estimation, and visual localization. The relative pose error is defined as the maximum of angular errors in rotation and translation. Results on the MegaDepth and ScanNet datasets show that EDM demonstrates superior performance compared with sparse and semi-dense methods. For homography estimation, the method was evaluated on the HPatches dataset. The mean reprojection error was computed for the four corners, and the AUC values were reported under 3, 5, and 10-pixel thresholds, where EDM notably outperforms other methods under all thresholds. For visual localization, assessments were done on the InLoc dataset and Aachen v1.1 dataset, within the open-sourced localization pipeline HLoc, where EDM performs comparably to sparse and semi-dense methods, demonstrating robust generalization in visual localization.

The authors also visualized the outcomes of self- and cross-attention separately, and observed that in the context of self-attention, the larger response points are more dispersed across different semantic regions, while in cross-attention, the significant response points are more concentrated in proximity to the potential matching points.

Ablation studies were performed, and the results indicate that adopting QKNA can improve evaluation metrics, especially for AUC@ $5^{\circ}$ , and that setting $L$ = 2 achieves an optimal balance between performance and efficiency. The soft coordinate classification (SCC) simplifies fine matching local offset regression, and compared to supervising regression results with L1 or L2 loss, the RLE loss significantly enhances regression accuracy without additional inference overhead.

A limitation of the work is that since the feature extraction network uses deeper layers, the efficiency improvement of EDM gradually decreases as the image resolution increases.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - chicleee/EDM: EDM: Efficient Deep Feature Matching

Tweets

https://twitter.com/zhenjun_zhao/status/1898958912709399036