Papers
Topics
Authors
Recent
Search
2000 character limit reached

SuperPoint Network: Joint Detection & Description

Updated 24 January 2026
  • SuperPoint Network is a fully-convolutional model that performs joint interest point detection and description in a single forward pass.
  • It employs Homographic Adaptation to aggregate pseudo-labels across random homographies, enhancing robustness in multi-view geometry tasks.
  • The architecture features a shared VGG-style encoder with dedicated detector and descriptor heads, achieving state-of-the-art performance on benchmarks like HPatches.

SuperPoint is a fully-convolutional neural architecture for joint interest point detection and description, operating on full-size images and trained in a self-supervised manner. It is designed to address challenges in multiple-view geometry tasks by producing dense, repeatable interest points and associated descriptors in a single forward pass. The system's distinctive contribution is the introduction of Homographic Adaptation, a multi-homography aggregation procedure enabling robust cross-domain adaptation, particularly from synthetic to real images. When trained on generic visual data such as MS-COCO, SuperPoint yields a richer and more repeatable set of keypoints compared to both its pre-adapted form and traditional corner detectors, and achieves state-of-the-art homography estimation performance on benchmarks such as HPatches (DeTone et al., 2017).

1. Network Architecture

The SuperPoint architecture consists of a VGG-style shared encoder followed by two heads: an interest point detector and a descriptor extractor.

  • Shared Encoder:
    • Two 3×33{\times}3 conv (64 ch), BatchNorm, ReLU
    • 2×22{\times}2 MaxPool (H/2×W/2H/2\times W/2)
    • Two 3×33{\times}3 conv (64 ch), BN, ReLU
    • 2×22{\times}2 MaxPool (H/4×W/4H/4\times W/4)
    • Two 3×33{\times}3 conv (128 ch), BN, ReLU
    • 2×22{\times}2 MaxPool (H/8×W/8H/8\times W/8)
    • Two 3×33{\times}3 conv (128 ch), BN, ReLU
    • Final output is the bottleneck tensor B\mathcal{B} of shape Hc×Wc×128H_c{\times}W_c{\times}128 with Hc=H/8H_c=H/8, Wc=W/8W_c=W/8.
  • Interest-Point Head (Detector):
    • 3×33{\times}3 conv (256 ch), BN, ReLU
    • 1×11{\times}1 conv (65 ch) to produce XRHc×Wc×65\mathcal{X}\in\mathbb{R}^{H_c\times W_c\times65}
    • Softmax over 65 channels (64 “cell” classes, 1 “no-point” class)
    • Discard the “no-point” bin; reshape the remaining map with pixel shuffle to output a H×WH{\times}W detector probability heatmap Pd(u,v)[0,1]P_d(u,v)\in[0,1].
  • Descriptor Head:
    • 3×33{\times}3 conv (256 ch), BN, ReLU
    • 1×11{\times}1 conv (DD ch), with D=256D=256 in experiments, producing DRHc×Wc×D\mathcal{D}\in\mathbb{R}^{H_c\times W_c\times D}
    • Bicubic upsampling to H×W×DH\times W\times D
    • L2-normalization of each DD-vector
    • Final output is a dense descriptor field d(u,v)RD\mathbf{d}(u,v)\in\mathbb{R}^D

2. Self-Supervised Training via Homographic Adaptation

SuperPoint is trained in a self-supervised paradigm using Homographic Adaptation, which leverages random planar homographies to construct pseudo-ground truth for keypoint locations.

  • Covariant Detector Principle:

The desired equivariance of the detector fθf_\theta with respect to homography H\mathcal{H} is formalized as:

fθ(I)=H1(fθ(H(I)))f_\theta(I) = \mathcal{H}^{-1}(f_\theta(\mathcal{H}(I)))

This ensures that the detector's predictions transform consistently under geometric warps.

  • Empirical Aggregation:

Given NhN_h random homographies {Hi}i=1Nh\{\mathcal{H}_i\}_{i=1}^{N_h}, the adapted detector is:

F^(I)=1Nhi=1NhHi1(fθ(Hi(I)))\widehat{F}(I) = \frac{1}{N_h}\sum_{i=1}^{N_h}\mathcal{H}_i^{-1}(f_\theta(\mathcal{H}_i(I)))

This is implemented by warping the input (and, subsequently, the detector response) under each Hi\mathcal{H}_i, then averaging the results to build a robust pseudo-label.

1
2
3
4
5
6
7
8
9
P_avg  0
for i=1N_h do
  H_i  sample_random_homography()
  I_i  warp_image(I, H_i)
  P_i  f_theta.detect(I_i)
  P_i  warp_image(P_i, H_i^{-1})
  P_avg  P_avg + P_i
end
P_avg  P_avg / N_h

  • Sampling of Homographies:

Each H\mathcal{H} is a composition of translation (±tmax\pm t_{\max}), scale (sN(1,σs)s\sim\mathcal{N}(1,\,\sigma_s)), in-plane rotation (θN(0,σθ)\theta\sim\mathcal{N}(0,\,\sigma_\theta)), and symmetric perspective warp, all from truncated Gaussians to avoid degenerate cases.

3. Loss Functions

Training is conducted on image pairs (I,I)(I,I') related by a known homography, with both detector and descriptor jointly optimized.

  • Detector (Point) Loss:

Per-cell cross-entropy over 65 classes using ground-truth heatmaps YY:

Lp(X,Y)=1HcWch=1Hcw=1Wc[logexp(Xh,w,yh,w)k=165exp(Xh,w,k)]\mathcal{L}_p(\mathcal{X}, Y) = \frac{1}{H_c W_c} \sum_{h=1}^{H_c}\sum_{w=1}^{W_c} \left[-\log \frac{\exp(X_{h,w,y_{h,w}})}{\sum_{k=1}^{65}\exp(X_{h,w,k})}\right]

  • Descriptor Loss:

For cells dhwD\mathbf{d}_{hw}\in\mathcal{D}, dhwD\mathbf{d}'_{h'w'}\in\mathcal{D}' from image pairs, define correspondence:

shwhw={1,H^phwphw8 0,otherwises_{hwh'w'} = \begin{cases} 1, & \|\widehat{\mathcal{H}}\,\mathbf{p}_{hw}-\mathbf{p}_{h'w'}\|\leq8 \ 0, & \text{otherwise} \end{cases}

and use a hinge loss:

ld(d,d;s)=λdsmax(0,mpdd)+(1s)max(0,ddmn)\begin{aligned} l_d(\mathbf{d}, \mathbf{d}'; s) &= \lambda_d\,s\,\max(0, m_p - \mathbf{d}^\top\mathbf{d}') +(1-s)\,\max(0, \mathbf{d}^\top\mathbf{d}' - m_n) \end{aligned}

Ld(D,D,S)=1(HcWc)2h,wh,wld(dhw,dhw;shwhw)\mathcal{L}_d(\mathcal{D}, \mathcal{D}', S) = \frac{1}{(H_c W_c)^2} \sum_{h,w}\sum_{h',w'} l_d(\mathbf{d}_{hw}, \mathbf{d}'_{h'w'}; s_{hwh'w'})

  • Total Loss:

L=Lp(X,Y)+Lp(X,Y)+λLd(D,D,S)\mathcal{L} = \mathcal{L}_p(\mathcal{X}, Y) + \mathcal{L}_p(\mathcal{X}', Y') + \lambda\,\mathcal{L}_d(\mathcal{D}, \mathcal{D}', S)

4. Interest-Point and Descriptor Extraction

SuperPoint outputs a dense probability map for interest points and a semi-dense descriptor field.

  • Keypoint Selection:
    • Threshold at τ=0.5\tau=0.5
    • Apply 2D NMS with a radius rr (4 or 8 pixels) to enforce spatial separation
    • Select the top KK points by confidence (e.g., 300 for repeatability, 1000 for homography estimation)
  • Descriptor Extraction:

For each detected keypoint (u,v)(u,v), the DD-dimensional descriptor is sampled from the upsampled descriptor map via bicubic interpolation; all descriptors are L2-normalized.

  • Matching Strategy:

Matching is performed with nearest-neighbor in Euclidean space, optionally employing a ratio test or mutual nearest neighbor check.

5. Training Procedure and Data

Training comprises synthetic pre-training, Homographic Adaptation, and joint detection-description optimization.

  • Synthetic Pre-Training (MagicPoint):

Training on rendered shapes with known corner positions for 200k iterations, with batch size 32, Adam optimization (lr=10310^{-3}, β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999) and random homography data augmentation.

  • Homographic Adaptation on MS-COCO:

Uses 80,000 COCO2014 images, resized to 240×320240\times320 grayscale. For each image, Nh=100N_h=100 random homographies are applied, and detector heatmaps are averaged to build robust pseudo-labels. A second adaptation round refines this estimation.

  • Joint SuperPoint Training:

Pairs each COCO image with a mild random homography to generate (I,I)(I, I') pairs. Hyperparameters: D=256D=256, λd=250\lambda_d=250, mp=1m_p=1, mn=0.2m_n=0.2, λ=104\lambda=10^{-4}. Batch size 32, Adam optimizer (lr=$0.001$), with standard vision augmentations (Gaussian noise, motion blur, photometric changes).

6. Evaluation and Comparative Results

Performance is primarily assessed on the HPatches dataset for repeatability and homography estimation.

HPatches Repeatability (240×320 resolution, 300 points, ϵ=3\epsilon=3 pixels)

Detector NMS=4 NMS=8
SuperPoint 0.652 0.631
MagicPoint 0.575 0.507
FAST 0.575 0.472
Harris 0.620 0.533
Shi-Tomasi 0.606 0.511
Random 0.101 0.103

Analogous improvements are observed under viewpoint variation.

HPatches Homography Estimation (480×640, 1000 points)

Correct if corner-transfer error ϵ\leq\epsilon pixels.

Method ϵ=1\epsilon=1 ϵ=3\epsilon=3 ϵ=5\epsilon=5
SuperPoint 0.310 0.684 0.829
LIFT 0.284 0.598 0.717
SIFT 0.424 0.676 0.759
ORB 0.150 0.395 0.538

Breakdown at ϵ=3\epsilon=3:

  • Repeatability: SuperPoint 0.581, LIFT 0.449, SIFT 0.495, ORB 0.641
  • Mean Localization Error (px): SuperPoint 1.158, LIFT 1.102, SIFT 0.833, ORB 1.157
  • Nearest Neighbor mAP (desc.): SuperPoint 0.821, LIFT 0.664, SIFT 0.694, ORB 0.735
  • Matching Score: SuperPoint 0.470, LIFT 0.315, SIFT 0.313, ORB 0.266

Qualitative results demonstrate dense and robust correspondence under illumination changes; failure modes arise on extreme in-plane rotations outside the sampled training distribution.

7. Significance and Observed Limitations

SuperPoint demonstrates that fully-convolutional, self-supervised architectures can jointly learn interest point detection and dense description from unlabeled natural images, attaining strong geometric invariance without manual annotation. The Homographic Adaptation approach enables effective transfer from synthetic shapes to real scenes through unsupervised pseudo-label aggregation. The observed limitations include reduced performance under extreme in-plane rotations exceeding the geometric conditions encountered during training, which suggests the potential need for more extensive augmentation regimes or alternative geometric regularization approaches for increased robustness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SuperPoint Network.