Papers
Topics
Authors
Recent
Search
2000 character limit reached

Descriptor-Free Extensions: FPC-Net

Updated 19 March 2026
  • The paper presents a novel descriptor-free matching approach that replaces explicit descriptors with a consistency-based training objective for keypoint detection.
  • FPC-Net leverages a lightweight MobileNetV3-Small backbone and a feature pyramid network to extract multiscale, semantically aligned keypoint features.
  • Performance metrics indicate competitive repeatability and homography estimation with an 8 ms runtime, making it well-suited for real-time applications in SLAM and visual localization.

Descriptor-free extensions, exemplified by FPC-Net, represent a paradigm shift in geometric computer vision by eliminating explicit feature descriptors during keypoint extraction and matching. Traditionally, correspondence between interest points across images is established via descriptors—vectors computed at each detected keypoint for appearance-based matching. Instead, FPC-Net leverages a single-stage keypoint detection network with feature pyramids and a consistency-based training objective for implicit, descriptor-free matching, drastically reducing memory requirements with competitive performance (Grigore et al., 14 Jul 2025).

1. Network Architecture and Feature Pyramid Construction

FPC-Net utilizes MobileNetV3-Small as a lightweight convolutional backbone, processing RGB input images I∈RHĂ—WĂ—3I \in \mathbb{R}^{H\times W\times 3}, with H=480H=480, W=640W=640. Four intermediate features, ClC^l, are extracted from the backbone at layers l∈{1,2,4,12}l \in \{1, 2, 4, 12\}:

  • C1∈R120Ă—160Ă—16C^1 \in \mathbb{R}^{120 \times 160 \times 16}
  • C2∈R60Ă—80Ă—24C^2 \in \mathbb{R}^{60 \times 80 \times 24}
  • C4∈R30Ă—40Ă—40C^4 \in \mathbb{R}^{30 \times 40 \times 40}
  • C12∈R15Ă—20Ă—576C^{12} \in \mathbb{R}^{15 \times 20 \times 576}

These multi-scale features are processed by a Feature Pyramid Network (FPN). Each ClC^l is projected to a 128-channel embedding via 1Ă—1 convolution (PinitlP^l_{init}), and fused using top-down upsampling (bicubic interpolation), yielding multiscale, semantically aligned features:

  • P12=Pinit12P^{12} = P^{12}_{init}
  • Pl=Pinitl+Upsample(Pl+1,scale=2)P^l = P^l_{init} + \text{Upsample}(P^{l+1}, \text{scale}=2) for l∈{4,2,1}l \in \{4, 2, 1\}

A final 1Ă—1 convolution and batch normalization on P1P^1 yields a single-channel heatmap of keypoint logits, pp, which is then upsampled to the original resolution to produce Y^=σ(p)∈[0,1]HĂ—W\hat{Y} = \sigma(p) \in [0,1]^{H\times W}, representing the normalized keypoint confidence.

2. Descriptor-Free Implicit Matching and Training Objective

FPC-Net dispenses with explicit keypoint descriptors by aligning keypoint heatmap peaks across transformed image pairs using a consistency-based loss. For an RGB image II and its warped counterpart I′=Warp(I,H)I' = \text{Warp}(I, H) (with homography HH), network outputs pp and p′p' are supervised to produce aligned heatmap peaks. Pseudo-ground-truth masks mm and m′m' are generated using LightGlue matches smoothed by a Gaussian.

The training objective comprises:

  • Sigmoid focal loss LdL_d: Applied to (p,m)(p, m) and (p′,m′)(p', m') to encourage detector sharpness.
  • Consistency loss LcL_c: Enforces peak correspondence, including a regression term LCreg=Huber(σ(p∘H),m′)+Huber(σ(p′∘H−1),m)L_{C_{reg}} = \text{Huber}(\sigma(p \circ H), m') + \text{Huber}(\sigma(p' \circ H^{-1}), m) and a KL divergence term LCclf=KL[S(p∘H)∥S(m′)]+KL[S(p′∘H−1)∥S(m)]L_{C_{clf}} = \text{KL}[S(p \circ H) \| S(m')] + \text{KL}[S(p' \circ H^{-1}) \| S(m)], where ∘H\circ H denotes warping, σ\sigma is the sigmoid function, and SS is spatial softmax.

At inference, the NN strongest peaks {xi}\{x_i\} are extracted from Y^\hat{Y} (after quantile thresholding and non-maximum suppression). Image-to-image correspondence is performed by nearest-neighbor search in spatial coordinates, exploiting geometric consistency established during training.

3. Training Methodology and Data Augmentation

FPC-Net is trained on the MS-COCO dataset, which contains unlabeled natural images. The supervised signal is provided in two phases:

  • Phase 1: Supervision with pseudo-ground-truth masks from a SuperPoint teacher network, using only the focal loss.
  • Phase 2: Supervision via smoothed keypoint masks derived from LightGlue matches under random homographies, with consistency and focal losses combined.

A diverse set of augmentations is deployed using the Albumentations library, including photometric (glass blur, motion blur, defocus, Gaussian noise, brightness/contrast) and geometric (perspective, affine, shift-scale-rotate, piecewise-affine) transformations.

The optimizer is Adam (lr=10−3\text{lr}=10^{-3}, β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999), with a batch size of 8 and a single NVIDIA V100 GPU. The training schedule consists of 10 epochs (phase 1) and 6 epochs (phase 2).

4. Computational Efficiency and Memory Analysis

FPC-Net is highly efficient relative to conventional descriptor-based detectors. The table below demonstrates per-image pair memory and runtime costs:

Method Runtime (ms) Descriptor Size (MB)
FPC-Net 8 0
SuperPoint 200 614
BRISK 78 153
SIFT 40 307.2
ORB 20 76.8

The total parameter count is approximately 2.6M (<10 MB model size), with feature map footprint at inference dominated by the pyramided features (≈10\approx10 MB).

5. Performance Evaluation

Key evaluations include repeatability, homography estimation, and pose estimation:

5.1 Keypoint Repeatability on HPatches

Method ϵ=1\epsilon=1 ϵ=3\epsilon=3 ϵ=8\epsilon=8
FPC-Net 0.46 0.59 0.67
SuperPoint 0.31 0.53 0.65
Shi 0.27 0.44 0.59
Harris 0.45 0.59 0.68
FAST 0.31 0.55 0.74
SIFT 0.27 0.46 0.70

5.2 Homography Estimation Accuracy on HPatches

Method ϵ=1\epsilon=1 ϵ=3\epsilon=3 ϵ=8\epsilon=8
FPC-Net 0.54 0.74 0.84
SuperPoint 0.36 0.75 0.93
BRISK 0.31 0.64 0.78
SIFT 0.44 0.78 0.89
ORB 0.17 0.43 0.58

FPC-Net outperforms SuperPoint in keypoint repeatability at all but the largest threshold and matches or outperforms SIFT in pose estimation for small correspondence set sizes, as measured on KITTI and EuRoC.

6. Applications, Limitations, and Future Directions

FPC-Net is particularly suited to large-scale visual localization and SLAM for resource-constrained platforms (such as drones and mobile devices), real-time robotics visual odometry where descriptor storage or transmission is prohibitively expensive, and augmented reality systems requiring low-latency keypoint matching over networks.

Key trade-offs include near state-of-the-art repeatability and homography estimation without any descriptor storage, real-time execution (8 ms runtime), and small model size (2.6M parameters). However, accuracy at large pixel thresholds (ϵ=8\epsilon=8) is slightly lower than descriptor-based methods (e.g., SuperPoint). The implicit matching via spatial proximity is susceptible to ambiguities in scenes with strong repetitive structure or extreme viewpoint changes.

Future directions proposed include integration of lightweight verification steps (e.g., learned cross-attention) to improve robustness, extension to dense matching for geometric primitives beyond points (e.g., lines, planes) via multi-channel heatmaps, and exploration of end-to-end training for correspondence estimation without the RANSAC post-processing step (Grigore et al., 14 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Descriptor-Free Extensions (FPC-Net).