SuperPoint Network: Joint Detection & Description

Updated 24 January 2026

SuperPoint Network is a fully-convolutional model that performs joint interest point detection and description in a single forward pass.
It employs Homographic Adaptation to aggregate pseudo-labels across random homographies, enhancing robustness in multi-view geometry tasks.
The architecture features a shared VGG-style encoder with dedicated detector and descriptor heads, achieving state-of-the-art performance on benchmarks like HPatches.

SuperPoint is a fully-convolutional neural architecture for joint interest point detection and description, operating on full-size images and trained in a self-supervised manner. It is designed to address challenges in multiple-view geometry tasks by producing dense, repeatable interest points and associated descriptors in a single forward pass. The system's distinctive contribution is the introduction of Homographic Adaptation, a multi-homography aggregation procedure enabling robust cross-domain adaptation, particularly from synthetic to real images. When trained on generic visual data such as MS-COCO, SuperPoint yields a richer and more repeatable set of keypoints compared to both its pre-adapted form and traditional corner detectors, and achieves state-of-the-art homography estimation performance on benchmarks such as HPatches (DeTone et al., 2017).

1. Network Architecture

The SuperPoint architecture consists of a VGG-style shared encoder followed by two heads: an interest point detector and a descriptor extractor.

Shared Encoder:
- Two $3{\times}3$ conv (64 ch), BatchNorm, ReLU
- $2{\times}2$ MaxPool ( $H/2\times W/2$ )
- Two $3{\times}3$ conv (64 ch), BN, ReLU
- $2{\times}2$ MaxPool ( $H/4\times W/4$ )
- Two $3{\times}3$ conv (128 ch), BN, ReLU
- $2{\times}2$ MaxPool ( $H/8\times W/8$ )
- Two $3{\times}3$ conv (128 ch), BN, ReLU
- Final output is the bottleneck tensor $\mathcal{B}$ of shape $H_c{\times}W_c{\times}128$ with $H_c=H/8$ , $W_c=W/8$ .
Interest-Point Head (Detector):
- $3{\times}3$ conv (256 ch), BN, ReLU
- $1{\times}1$ conv (65 ch) to produce $\mathcal{X}\in\mathbb{R}^{H_c\times W_c\times65}$
- Softmax over 65 channels (64 “cell” classes, 1 “no-point” class)
- Discard the “no-point” bin; reshape the remaining map with pixel shuffle to output a $H{\times}W$ detector probability heatmap $P_d(u,v)\in[0,1]$ .
Descriptor Head:
- $3{\times}3$ conv (256 ch), BN, ReLU
- $1{\times}1$ conv ( $D$ ch), with $D=256$ in experiments, producing $\mathcal{D}\in\mathbb{R}^{H_c\times W_c\times D}$
- Bicubic upsampling to $H\times W\times D$
- L2-normalization of each $D$ -vector
- Final output is a dense descriptor field $\mathbf{d}(u,v)\in\mathbb{R}^D$

2. Self-Supervised Training via Homographic Adaptation

SuperPoint is trained in a self-supervised paradigm using Homographic Adaptation, which leverages random planar homographies to construct pseudo-ground truth for keypoint locations.

Covariant Detector Principle:

The desired equivariance of the detector $f_\theta$ with respect to homography $\mathcal{H}$ is formalized as:

$f_\theta(I) = \mathcal{H}^{-1}(f_\theta(\mathcal{H}(I)))$

This ensures that the detector's predictions transform consistently under geometric warps.

Empirical Aggregation:

Given $N_h$ random homographies $\{\mathcal{H}_i\}_{i=1}^{N_h}$ , the adapted detector is:

$\widehat{F}(I) = \frac{1}{N_h}\sum_{i=1}^{N_h}\mathcal{H}_i^{-1}(f_\theta(\mathcal{H}_i(I)))$

This is implemented by warping the input (and, subsequently, the detector response) under each $\mathcal{H}_i$ , then averaging the results to build a robust pseudo-label.

P_avg ← 0
for i=1…N_h do
  H_i ← sample_random_homography()
  I_i ← warp_image(I, H_i)
  P_i ← f_theta.detect(I_i)
  P_i′ ← warp_image(P_i, H_i^{-1})
  P_avg ← P_avg + P_i′
end
P_avg ← P_avg / N_h

Sampling of Homographies:

Each $\mathcal{H}$ is a composition of translation ( $\pm t_{\max}$ ), scale ( $s\sim\mathcal{N}(1,\,\sigma_s)$ ), in-plane rotation ( $\theta\sim\mathcal{N}(0,\,\sigma_\theta)$ ), and symmetric perspective warp, all from truncated Gaussians to avoid degenerate cases.

3. Loss Functions

Training is conducted on image pairs $(I,I')$ related by a known homography, with both detector and descriptor jointly optimized.

Detector (Point) Loss:

Per-cell cross-entropy over 65 classes using ground-truth heatmaps $Y$ :

$\mathcal{L}_p(\mathcal{X}, Y) = \frac{1}{H_c W_c} \sum_{h=1}^{H_c}\sum_{w=1}^{W_c} \left[-\log \frac{\exp(X_{h,w,y_{h,w}})}{\sum_{k=1}^{65}\exp(X_{h,w,k})}\right]$

Descriptor Loss:

For cells $\mathbf{d}_{hw}\in\mathcal{D}$ , $\mathbf{d}'_{h'w'}\in\mathcal{D}'$ from image pairs, define correspondence:

$s_{hwh'w'} = \begin{cases} 1, & \|\widehat{\mathcal{H}}\,\mathbf{p}_{hw}-\mathbf{p}_{h'w'}\|\leq8 \ 0, & \text{otherwise} \end{cases}$

and use a hinge loss:

$\begin{aligned} l_d(\mathbf{d}, \mathbf{d}'; s) &= \lambda_d\,s\,\max(0, m_p - \mathbf{d}^\top\mathbf{d}') +(1-s)\,\max(0, \mathbf{d}^\top\mathbf{d}' - m_n) \end{aligned}$

$\mathcal{L}_d(\mathcal{D}, \mathcal{D}', S) = \frac{1}{(H_c W_c)^2} \sum_{h,w}\sum_{h',w'} l_d(\mathbf{d}_{hw}, \mathbf{d}'_{h'w'}; s_{hwh'w'})$

Total Loss:

$\mathcal{L} = \mathcal{L}_p(\mathcal{X}, Y) + \mathcal{L}_p(\mathcal{X}', Y') + \lambda\,\mathcal{L}_d(\mathcal{D}, \mathcal{D}', S)$

4. Interest-Point and Descriptor Extraction

SuperPoint outputs a dense probability map for interest points and a semi-dense descriptor field.

Keypoint Selection:
- Threshold at $\tau=0.5$
- Apply 2D NMS with a radius $r$ (4 or 8 pixels) to enforce spatial separation
- Select the top $K$ points by confidence (e.g., 300 for repeatability, 1000 for homography estimation)
Descriptor Extraction:

For each detected keypoint $(u,v)$ , the $D$ -dimensional descriptor is sampled from the upsampled descriptor map via bicubic interpolation; all descriptors are L2-normalized.

Matching Strategy:

Matching is performed with nearest-neighbor in Euclidean space, optionally employing a ratio test or mutual nearest neighbor check.

5. Training Procedure and Data

Training comprises synthetic pre-training, Homographic Adaptation, and joint detection-description optimization.

Synthetic Pre-Training (MagicPoint):

Training on rendered shapes with known corner positions for 200k iterations, with batch size 32, Adam optimization (lr= $10^{-3}$ , $\beta_1=0.9$ , $\beta_2=0.999$ ) and random homography data augmentation.

Homographic Adaptation on MS-COCO:

Uses 80,000 COCO2014 images, resized to $240\times320$ grayscale. For each image, $N_h=100$ random homographies are applied, and detector heatmaps are averaged to build robust pseudo-labels. A second adaptation round refines this estimation.

Joint SuperPoint Training:

Pairs each COCO image with a mild random homography to generate $(I, I')$ pairs. Hyperparameters: $D=256$ , $\lambda_d=250$ , $m_p=1$ , $m_n=0.2$ , $\lambda=10^{-4}$ . Batch size 32, Adam optimizer (lr=$0.001$), with standard vision augmentations (Gaussian noise, motion blur, photometric changes).

6. Evaluation and Comparative Results

Performance is primarily assessed on the HPatches dataset for repeatability and homography estimation.

HPatches Repeatability (240×320 resolution, 300 points, $\epsilon=3$ pixels)

Detector	NMS=4	NMS=8
SuperPoint	0.652	0.631
MagicPoint	0.575	0.507
FAST	0.575	0.472
Harris	0.620	0.533
Shi-Tomasi	0.606	0.511
Random	0.101	0.103

Analogous improvements are observed under viewpoint variation.

HPatches Homography Estimation (480×640, 1000 points)

Correct if corner-transfer error $\leq\epsilon$ pixels.

Method	$\epsilon=1$	$\epsilon=3$	$\epsilon=5$
SuperPoint	0.310	0.684	0.829
LIFT	0.284	0.598	0.717
SIFT	0.424	0.676	0.759
ORB	0.150	0.395	0.538

Breakdown at $\epsilon=3$ :

Repeatability: SuperPoint 0.581, LIFT 0.449, SIFT 0.495, ORB 0.641
Mean Localization Error (px): SuperPoint 1.158, LIFT 1.102, SIFT 0.833, ORB 1.157
Nearest Neighbor mAP (desc.): SuperPoint 0.821, LIFT 0.664, SIFT 0.694, ORB 0.735
Matching Score: SuperPoint 0.470, LIFT 0.315, SIFT 0.313, ORB 0.266

Qualitative results demonstrate dense and robust correspondence under illumination changes; failure modes arise on extreme in-plane rotations outside the sampled training distribution.

7. Significance and Observed Limitations

SuperPoint demonstrates that fully-convolutional, self-supervised architectures can jointly learn interest point detection and dense description from unlabeled natural images, attaining strong geometric invariance without manual annotation. The Homographic Adaptation approach enables effective transfer from synthetic shapes to real scenes through unsupervised pseudo-label aggregation. The observed limitations include reduced performance under extreme in-plane rotations exceeding the geometric conditions encountered during training, which suggests the potential need for more extensive augmentation regimes or alternative geometric regularization approaches for increased robustness.

Markdown Report Issue Upgrade to Chat

References (1)

SuperPoint: Self-Supervised Interest Point Detection and Description (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SuperPoint Network.