- The paper introduces EDM, an Efficient Deep feature Matching network that improves speed and accuracy in detector-free matching through a novel pipeline.
- EDM utilizes a Correlation Injection Module (CIM) for hierarchical feature correlation and a lightweight bidirectional Axis-Based Regression Head (ABRHead) for subpixel match estimation.
- Experimental results show EDM achieves state-of-the-art performance and high efficiency on benchmarks like MegaDepth, ScanNet, and HPatches for tasks including pose and homography estimation.
The paper introduces an Efficient Deep feature Matching network (EDM) that enhances the speed and accuracy of feature matching by improving each stage of the conventional detector-free matching pipeline. The key components of EDM include a deep but narrow CNN backbone, a Correlation Injection Module (CIM), and a bidirectional axis-based regression head. The results show that EDM achieves state-of-the-art performance with high efficiency, offering valuable practices for real-world applications.
The contributions of this work are:
- A new detector-free matcher is presented that improves efficiency while maintaining accuracy through paradigm redesign.
- The CIM models deep feature correlations with high-level context information and integrates global and local features through hierarchical correlation injection.
- A lightweight bidirectional axis-based regression head is introduced for implicit subpixel-level match estimation.
- Selection strategies are presented to improve both coarse and fine stage accuracy.
The method begins with a ResNet-18 backbone to extract multi-level feature maps FdA and FdB at a 321 scale, balancing efficiency with the capture of high-level contextual information.
The CIM aggregates multi-scale features before coarse matching, using stacked Transformers and Injection Layers (ILs). Deep feature maps are transformed by alternating self-attention and cross-attention L times, capturing feature correlations between images. The 2D rotary positional embedding (RoPE) captures relative spatial information in self-attention layers. Query-Key Normalized Attention (QKNA) replaces vanilla attention to enhance correlation modeling. The QKNA is defined as:
QKNormAtt(Q,K,V)=softmax(s⋅Q^K^T)V
where:
- Q is the query
- K is the key
- V is the value
- s is a manual scale factor
- Q^ and K^ are obtained by applying L2 normalization in the head dimensions.
Two cascaded ILs upsample features to a 81 scale, taking backbone local features and deep features with global correlations as inputs. Local features pass through a convolution and batch normalization layer (CB). Low-resolution deep features are fed into a CBA block (convolution, batch normalization, and sigmoid activation function) to generate weights determining local feature detail retention. The output is upsampled and injected into local features via element-wise product. Global features pass through another CB block and bilinear interpolation upsampling, then are element-wise added to the injected features. A 3×3 depthwise convolution (DW) alleviates upsampling aliasing.
Coarse-level matches are established from the coarse feature maps FcA and FcB after correlation injection. Each pixel represents an 8×8 grid region in the original images. The feature maps FcA and FcB are flattened to 1-D vectors ${\tilde{F}_{c}^{A}$ and ${\tilde{F}_{c}^{B}$, and a similarity matrix S is built using the inner product:
$\mathcal{S} = \frac{\left \langle {\tilde{F}_{c}^{A}(i), {\tilde{F}_{c}^{B}(j)\right \rangle}{\tau}$
where:
- S is the similarity matrix
- F~cA(i) is the flattened feature vector for image A at index i
- F~cB(j) is the flattened feature vector for image B at index j
- τ is the temperature parameter.
The matching probability matrix Pc is obtained by a dual-softmax operator:
Pc=softmax(S(i,⋯))j⋅softmax(S(⋯,j))i
This can be efficiently implemented by:
$\mathcal{P}_{c} = \frac{\mathcal{Z}{\left \|\mathcal{Z}(i,\cdots)_{j} \right \|_{1}\cdot \frac{\mathcal{Z}}{\left \|\mathcal{Z}(\cdots,j)_{i} \right \|_{1}}$
where:
- Z=eS
Maximum values are obtained from each row of Pc, and the Top-K scoring values are selected, with probabilities exceeding a threshold θc.
For fine-level matching, offsets are regressed directly from latent features, avoiding pixel-level keypoint localization. Backbone features FfA , FfB and coarse-level features FcA , FcB are summed element-wise and used as inputs. Fine-level corresponding features are extracted using coarse matching indices and flattened to 1-D vectors FA , FB. The central pixel of grids are considered keypoints PA, PB with descriptors FA , FB. A bidirectional refinement strategy obtains double fine matches: FA , FB are concatenated as query features FqA , FqB, and reference features FrB , FrA are in the reverse order. They pass through query and reference encoders, each a lightweight Multi-Layer Perceptron (MLP). The corresponding features are concatenated and merged through another MLP.
A lightweight Axis-Based Regression Head (ABRHead) with Soft Coordinate Classification (SCC) is used. The merged feature passes through linear layers to reduce output dimension to N+1. The N-D tensor passes through soft-argmax to predict a location parameter μ. Another 1-D tensor passes through a sigmoid to predict a scale parameter σ. The output μ and σ shift and scale the distribution generated by a flow model. The predict μ is equivalent to the normalized offset Δ.
The prediction confidence is obtained by:
Pf=1−2σx+σy
where:
- σx is the σ on X-axis
- σy is the σ on Y-axis
For each bidirectional matching pair, the more confident one is kept if above the fine-level threshold θf.
The coarse-level ground truth matches Mc are generated by warping grid centroids from input image IA to IB using relative camera poses and depth maps at 81 scale. The matching probability matrix Pc is supervised by minimizing the focal loss:
$\mathcal{L}_{c} = - \frac{1}{\left | \mathcal{M}_{c}\right |}\displaystyle\sum_{\left \langle i,j\right \rangle \in \mathcal{M}_{c}^{} \alpha \left ( 1- \mathcal{P}_{c}{\left \langle i,j\right \rangle}\right )^{\gamma}\log_{}{\mathcal{P}_{c}{\left \langle i,j\right \rangle}}$
where:
- Lc is the coarse-level loss
- Mc is the ground truth matches
- α is the weighting factor
- γ is the focusing parameter
The residual log-likelihood estimation (RLE) loss is employed to improve offset regression performance:
Lf=−logGϕ(x^)−logQϕ(x^)+logσ
where:
- Gϕ(x^) is the distribution learned by the normalizing flow model ϕ
- Qϕ(x^) is a simple Laplace distribution
- σ is the prediction scale parameter
The Laplace distribution loss item about Qϕ(x^) is defined as:
$\mathcal{Q}_{\phi}\left ( \hat{x}\right )= \displaystyle\sum_{\mathcal{M}_{f}^{}\frac{1}{\sigma}e^{-\frac{\left | \mu^{gt} -\mu \right |}{2\sigma}}$
where:
- μgt is the corresponding ground truth offsets.
- ${\mathcal{M}_{f}$ is the ground truth fine-level matches
The total loss is:
L=λcLc+λfLf
where:
- λc is coarse-level loss weight
- λf is fine-level loss weight
The backbone feature widths from 21 scale to 321 scale are [32, 64, 128, 256, 256]. L is set to 2 in the CIM. The coordinate bins number N in ABRHead is 16. The attention scale factor s is set to 20. The training utilizes the AdamW optimizer with an initial learning rate of 2e-3 and a batch size of 32.
Experiments were performed for relative pose estimation, homography estimation, and visual localization. The relative pose error is defined as the maximum of angular errors in rotation and translation. Results on the MegaDepth and ScanNet datasets show that EDM demonstrates superior performance compared with sparse and semi-dense methods. For homography estimation, the method was evaluated on the HPatches dataset. The mean reprojection error was computed for the four corners, and the AUC values were reported under 3, 5, and 10-pixel thresholds, where EDM notably outperforms other methods under all thresholds. For visual localization, assessments were done on the InLoc dataset and Aachen v1.1 dataset, within the open-sourced localization pipeline HLoc, where EDM performs comparably to sparse and semi-dense methods, demonstrating robust generalization in visual localization.
The authors also visualized the outcomes of self- and cross-attention separately, and observed that in the context of self-attention, the larger response points are more dispersed across different semantic regions, while in cross-attention, the significant response points are more concentrated in proximity to the potential matching points.
Ablation studies were performed, and the results indicate that adopting QKNA can improve evaluation metrics, especially for AUC@5∘, and that setting L = 2 achieves an optimal balance between performance and efficiency. The soft coordinate classification (SCC) simplifies fine matching local offset regression, and compared to supervising regression results with L1 or L2 loss, the RLE loss significantly enhances regression accuracy without additional inference overhead.
A limitation of the work is that since the feature extraction network uses deeper layers, the efficiency improvement of EDM gradually decreases as the image resolution increases.