Descriptor-Free Extensions: FPC-Net

Updated 19 March 2026

The paper presents a novel descriptor-free matching approach that replaces explicit descriptors with a consistency-based training objective for keypoint detection.
FPC-Net leverages a lightweight MobileNetV3-Small backbone and a feature pyramid network to extract multiscale, semantically aligned keypoint features.
Performance metrics indicate competitive repeatability and homography estimation with an 8 ms runtime, making it well-suited for real-time applications in SLAM and visual localization.

Descriptor-free extensions, exemplified by FPC-Net, represent a paradigm shift in geometric computer vision by eliminating explicit feature descriptors during keypoint extraction and matching. Traditionally, correspondence between interest points across images is established via descriptors—vectors computed at each detected keypoint for appearance-based matching. Instead, FPC-Net leverages a single-stage keypoint detection network with feature pyramids and a consistency-based training objective for implicit, descriptor-free matching, drastically reducing memory requirements with competitive performance (Grigore et al., 14 Jul 2025).

1. Network Architecture and Feature Pyramid Construction

FPC-Net utilizes MobileNetV3-Small as a lightweight convolutional backbone, processing RGB input images $I \in \mathbb{R}^{H\times W\times 3}$ , with $H=480$ , $W=640$ . Four intermediate features, $C^l$ , are extracted from the backbone at layers $l \in \{1, 2, 4, 12\}$ :

$C^1 \in \mathbb{R}^{120 \times 160 \times 16}$
$C^2 \in \mathbb{R}^{60 \times 80 \times 24}$
$C^4 \in \mathbb{R}^{30 \times 40 \times 40}$
$C^{12} \in \mathbb{R}^{15 \times 20 \times 576}$

These multi-scale features are processed by a Feature Pyramid Network (FPN). Each $C^l$ is projected to a 128-channel embedding via 1×1 convolution ( $P^l_{init}$ ), and fused using top-down upsampling (bicubic interpolation), yielding multiscale, semantically aligned features:

$P^{12} = P^{12}_{init}$
$P^l = P^l_{init} + \text{Upsample}(P^{l+1}, \text{scale}=2)$ for $l \in \{4, 2, 1\}$

A final 1×1 convolution and batch normalization on $P^1$ yields a single-channel heatmap of keypoint logits, $p$ , which is then upsampled to the original resolution to produce $\hat{Y} = \sigma(p) \in [0,1]^{H\times W}$ , representing the normalized keypoint confidence.

2. Descriptor-Free Implicit Matching and Training Objective

FPC-Net dispenses with explicit keypoint descriptors by aligning keypoint heatmap peaks across transformed image pairs using a consistency-based loss. For an RGB image $I$ and its warped counterpart $I' = \text{Warp}(I, H)$ (with homography $H$ ), network outputs $p$ and $p'$ are supervised to produce aligned heatmap peaks. Pseudo-ground-truth masks $m$ and $m'$ are generated using LightGlue matches smoothed by a Gaussian.

The training objective comprises:

Sigmoid focal loss $L_d$ : Applied to $(p, m)$ and $(p', m')$ to encourage detector sharpness.
Consistency loss $L_c$ : Enforces peak correspondence, including a regression term $L_{C_{reg}} = \text{Huber}(\sigma(p \circ H), m') + \text{Huber}(\sigma(p' \circ H^{-1}), m)$ and a KL divergence term $L_{C_{clf}} = \text{KL}[S(p \circ H) \| S(m')] + \text{KL}[S(p' \circ H^{-1}) \| S(m)]$ , where $\circ H$ denotes warping, $\sigma$ is the sigmoid function, and $S$ is spatial softmax.

At inference, the $N$ strongest peaks $\{x_i\}$ are extracted from $\hat{Y}$ (after quantile thresholding and non-maximum suppression). Image-to-image correspondence is performed by nearest-neighbor search in spatial coordinates, exploiting geometric consistency established during training.

3. Training Methodology and Data Augmentation

FPC-Net is trained on the MS-COCO dataset, which contains unlabeled natural images. The supervised signal is provided in two phases:

Phase 1: Supervision with pseudo-ground-truth masks from a SuperPoint teacher network, using only the focal loss.
Phase 2: Supervision via smoothed keypoint masks derived from LightGlue matches under random homographies, with consistency and focal losses combined.

A diverse set of augmentations is deployed using the Albumentations library, including photometric (glass blur, motion blur, defocus, Gaussian noise, brightness/contrast) and geometric (perspective, affine, shift-scale-rotate, piecewise-affine) transformations.

The optimizer is Adam ( $\text{lr}=10^{-3}$ , $\beta_1=0.9$ , $\beta_2=0.999$ ), with a batch size of 8 and a single NVIDIA V100 GPU. The training schedule consists of 10 epochs (phase 1) and 6 epochs (phase 2).

4. Computational Efficiency and Memory Analysis

FPC-Net is highly efficient relative to conventional descriptor-based detectors. The table below demonstrates per-image pair memory and runtime costs:

Method	Runtime (ms)	Descriptor Size (MB)
FPC-Net	8	0
SuperPoint	200	614
BRISK	78	153
SIFT	40	307.2
ORB	20	76.8

The total parameter count is approximately 2.6M (<10 MB model size), with feature map footprint at inference dominated by the pyramided features ( $\approx10$ MB).

5. Performance Evaluation

Key evaluations include repeatability, homography estimation, and pose estimation:

5.1 Keypoint Repeatability on HPatches

Method	$\epsilon=1$	$\epsilon=3$	$\epsilon=8$
FPC-Net	0.46	0.59	0.67
SuperPoint	0.31	0.53	0.65
Shi	0.27	0.44	0.59
Harris	0.45	0.59	0.68
FAST	0.31	0.55	0.74
SIFT	0.27	0.46	0.70

5.2 Homography Estimation Accuracy on HPatches

Method	$\epsilon=1$	$\epsilon=3$	$\epsilon=8$
FPC-Net	0.54	0.74	0.84
SuperPoint	0.36	0.75	0.93
BRISK	0.31	0.64	0.78
SIFT	0.44	0.78	0.89
ORB	0.17	0.43	0.58

FPC-Net outperforms SuperPoint in keypoint repeatability at all but the largest threshold and matches or outperforms SIFT in pose estimation for small correspondence set sizes, as measured on KITTI and EuRoC.

6. Applications, Limitations, and Future Directions

FPC-Net is particularly suited to large-scale visual localization and SLAM for resource-constrained platforms (such as drones and mobile devices), real-time robotics visual odometry where descriptor storage or transmission is prohibitively expensive, and augmented reality systems requiring low-latency keypoint matching over networks.

Key trade-offs include near state-of-the-art repeatability and homography estimation without any descriptor storage, real-time execution (8 ms runtime), and small model size (2.6M parameters). However, accuracy at large pixel thresholds ( $\epsilon=8$ ) is slightly lower than descriptor-based methods (e.g., SuperPoint). The implicit matching via spatial proximity is susceptible to ambiguities in scenes with strong repetitive structure or extreme viewpoint changes.

Future directions proposed include integration of lightweight verification steps (e.g., learned cross-attention) to improve robustness, extension to dense matching for geometric primitives beyond points (e.g., lines, planes) via multi-channel heatmaps, and exploration of end-to-end training for correspondence estimation without the RANSAC post-processing step (Grigore et al., 14 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

FPC-Net: Revisiting SuperPoint with Descriptor-Free Keypoint Detection via Feature Pyramids and Consistency-Based Implicit Matching (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Descriptor-Free Extensions (FPC-Net).

Descriptor-Free Extensions: FPC-Net

1. Network Architecture and Feature Pyramid Construction

2. Descriptor-Free Implicit Matching and Training Objective

3. Training Methodology and Data Augmentation

4. Computational Efficiency and Memory Analysis

5. Performance Evaluation

5.1 Keypoint Repeatability on HPatches

5.2 Homography Estimation Accuracy on HPatches

6. Applications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Descriptor-Free Extensions: FPC-Net

1. Network Architecture and Feature Pyramid Construction

2. Descriptor-Free Implicit Matching and Training Objective

3. Training Methodology and Data Augmentation

4. Computational Efficiency and Memory Analysis

5. Performance Evaluation

5.1 Keypoint Repeatability on HPatches

5.2 Homography Estimation Accuracy on HPatches

6. Applications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research