Descriptor-Free Extensions: FPC-Net
- The paper presents a novel descriptor-free matching approach that replaces explicit descriptors with a consistency-based training objective for keypoint detection.
- FPC-Net leverages a lightweight MobileNetV3-Small backbone and a feature pyramid network to extract multiscale, semantically aligned keypoint features.
- Performance metrics indicate competitive repeatability and homography estimation with an 8 ms runtime, making it well-suited for real-time applications in SLAM and visual localization.
Descriptor-free extensions, exemplified by FPC-Net, represent a paradigm shift in geometric computer vision by eliminating explicit feature descriptors during keypoint extraction and matching. Traditionally, correspondence between interest points across images is established via descriptors—vectors computed at each detected keypoint for appearance-based matching. Instead, FPC-Net leverages a single-stage keypoint detection network with feature pyramids and a consistency-based training objective for implicit, descriptor-free matching, drastically reducing memory requirements with competitive performance (Grigore et al., 14 Jul 2025).
1. Network Architecture and Feature Pyramid Construction
FPC-Net utilizes MobileNetV3-Small as a lightweight convolutional backbone, processing RGB input images , with , . Four intermediate features, , are extracted from the backbone at layers :
These multi-scale features are processed by a Feature Pyramid Network (FPN). Each is projected to a 128-channel embedding via 1Ă—1 convolution (), and fused using top-down upsampling (bicubic interpolation), yielding multiscale, semantically aligned features:
- for
A final 1Ă—1 convolution and batch normalization on yields a single-channel heatmap of keypoint logits, , which is then upsampled to the original resolution to produce , representing the normalized keypoint confidence.
2. Descriptor-Free Implicit Matching and Training Objective
FPC-Net dispenses with explicit keypoint descriptors by aligning keypoint heatmap peaks across transformed image pairs using a consistency-based loss. For an RGB image and its warped counterpart (with homography ), network outputs and are supervised to produce aligned heatmap peaks. Pseudo-ground-truth masks and are generated using LightGlue matches smoothed by a Gaussian.
The training objective comprises:
- Sigmoid focal loss : Applied to and to encourage detector sharpness.
- Consistency loss : Enforces peak correspondence, including a regression term and a KL divergence term , where denotes warping, is the sigmoid function, and is spatial softmax.
At inference, the strongest peaks are extracted from (after quantile thresholding and non-maximum suppression). Image-to-image correspondence is performed by nearest-neighbor search in spatial coordinates, exploiting geometric consistency established during training.
3. Training Methodology and Data Augmentation
FPC-Net is trained on the MS-COCO dataset, which contains unlabeled natural images. The supervised signal is provided in two phases:
- Phase 1: Supervision with pseudo-ground-truth masks from a SuperPoint teacher network, using only the focal loss.
- Phase 2: Supervision via smoothed keypoint masks derived from LightGlue matches under random homographies, with consistency and focal losses combined.
A diverse set of augmentations is deployed using the Albumentations library, including photometric (glass blur, motion blur, defocus, Gaussian noise, brightness/contrast) and geometric (perspective, affine, shift-scale-rotate, piecewise-affine) transformations.
The optimizer is Adam (, , ), with a batch size of 8 and a single NVIDIA V100 GPU. The training schedule consists of 10 epochs (phase 1) and 6 epochs (phase 2).
4. Computational Efficiency and Memory Analysis
FPC-Net is highly efficient relative to conventional descriptor-based detectors. The table below demonstrates per-image pair memory and runtime costs:
| Method | Runtime (ms) | Descriptor Size (MB) |
|---|---|---|
| FPC-Net | 8 | 0 |
| SuperPoint | 200 | 614 |
| BRISK | 78 | 153 |
| SIFT | 40 | 307.2 |
| ORB | 20 | 76.8 |
The total parameter count is approximately 2.6M (<10 MB model size), with feature map footprint at inference dominated by the pyramided features ( MB).
5. Performance Evaluation
Key evaluations include repeatability, homography estimation, and pose estimation:
5.1 Keypoint Repeatability on HPatches
| Method | |||
|---|---|---|---|
| FPC-Net | 0.46 | 0.59 | 0.67 |
| SuperPoint | 0.31 | 0.53 | 0.65 |
| Shi | 0.27 | 0.44 | 0.59 |
| Harris | 0.45 | 0.59 | 0.68 |
| FAST | 0.31 | 0.55 | 0.74 |
| SIFT | 0.27 | 0.46 | 0.70 |
5.2 Homography Estimation Accuracy on HPatches
| Method | |||
|---|---|---|---|
| FPC-Net | 0.54 | 0.74 | 0.84 |
| SuperPoint | 0.36 | 0.75 | 0.93 |
| BRISK | 0.31 | 0.64 | 0.78 |
| SIFT | 0.44 | 0.78 | 0.89 |
| ORB | 0.17 | 0.43 | 0.58 |
FPC-Net outperforms SuperPoint in keypoint repeatability at all but the largest threshold and matches or outperforms SIFT in pose estimation for small correspondence set sizes, as measured on KITTI and EuRoC.
6. Applications, Limitations, and Future Directions
FPC-Net is particularly suited to large-scale visual localization and SLAM for resource-constrained platforms (such as drones and mobile devices), real-time robotics visual odometry where descriptor storage or transmission is prohibitively expensive, and augmented reality systems requiring low-latency keypoint matching over networks.
Key trade-offs include near state-of-the-art repeatability and homography estimation without any descriptor storage, real-time execution (8 ms runtime), and small model size (2.6M parameters). However, accuracy at large pixel thresholds () is slightly lower than descriptor-based methods (e.g., SuperPoint). The implicit matching via spatial proximity is susceptible to ambiguities in scenes with strong repetitive structure or extreme viewpoint changes.
Future directions proposed include integration of lightweight verification steps (e.g., learned cross-attention) to improve robustness, extension to dense matching for geometric primitives beyond points (e.g., lines, planes) via multi-channel heatmaps, and exploration of end-to-end training for correspondence estimation without the RANSAC post-processing step (Grigore et al., 14 Jul 2025).