SuperPoint Network: Joint Detection & Description
- SuperPoint Network is a fully-convolutional model that performs joint interest point detection and description in a single forward pass.
- It employs Homographic Adaptation to aggregate pseudo-labels across random homographies, enhancing robustness in multi-view geometry tasks.
- The architecture features a shared VGG-style encoder with dedicated detector and descriptor heads, achieving state-of-the-art performance on benchmarks like HPatches.
SuperPoint is a fully-convolutional neural architecture for joint interest point detection and description, operating on full-size images and trained in a self-supervised manner. It is designed to address challenges in multiple-view geometry tasks by producing dense, repeatable interest points and associated descriptors in a single forward pass. The system's distinctive contribution is the introduction of Homographic Adaptation, a multi-homography aggregation procedure enabling robust cross-domain adaptation, particularly from synthetic to real images. When trained on generic visual data such as MS-COCO, SuperPoint yields a richer and more repeatable set of keypoints compared to both its pre-adapted form and traditional corner detectors, and achieves state-of-the-art homography estimation performance on benchmarks such as HPatches (DeTone et al., 2017).
1. Network Architecture
The SuperPoint architecture consists of a VGG-style shared encoder followed by two heads: an interest point detector and a descriptor extractor.
- Shared Encoder:
- Two conv (64 ch), BatchNorm, ReLU
- MaxPool ()
- Two conv (64 ch), BN, ReLU
- MaxPool ()
- Two conv (128 ch), BN, ReLU
- MaxPool ()
- Two conv (128 ch), BN, ReLU
- Final output is the bottleneck tensor of shape with , .
- Interest-Point Head (Detector):
- conv (256 ch), BN, ReLU
- conv (65 ch) to produce
- Softmax over 65 channels (64 “cell” classes, 1 “no-point” class)
- Discard the “no-point” bin; reshape the remaining map with pixel shuffle to output a detector probability heatmap .
- Descriptor Head:
- conv (256 ch), BN, ReLU
- conv ( ch), with in experiments, producing
- Bicubic upsampling to
- L2-normalization of each -vector
- Final output is a dense descriptor field
2. Self-Supervised Training via Homographic Adaptation
SuperPoint is trained in a self-supervised paradigm using Homographic Adaptation, which leverages random planar homographies to construct pseudo-ground truth for keypoint locations.
- Covariant Detector Principle:
The desired equivariance of the detector with respect to homography is formalized as:
This ensures that the detector's predictions transform consistently under geometric warps.
- Empirical Aggregation:
Given random homographies , the adapted detector is:
This is implemented by warping the input (and, subsequently, the detector response) under each , then averaging the results to build a robust pseudo-label.
1 2 3 4 5 6 7 8 9 |
P_avg ← 0 for i=1…N_h do H_i ← sample_random_homography() I_i ← warp_image(I, H_i) P_i ← f_theta.detect(I_i) P_i′ ← warp_image(P_i, H_i^{-1}) P_avg ← P_avg + P_i′ end P_avg ← P_avg / N_h |
- Sampling of Homographies:
Each is a composition of translation (), scale (), in-plane rotation (), and symmetric perspective warp, all from truncated Gaussians to avoid degenerate cases.
3. Loss Functions
Training is conducted on image pairs related by a known homography, with both detector and descriptor jointly optimized.
- Detector (Point) Loss:
Per-cell cross-entropy over 65 classes using ground-truth heatmaps :
- Descriptor Loss:
For cells , from image pairs, define correspondence:
and use a hinge loss:
- Total Loss:
4. Interest-Point and Descriptor Extraction
SuperPoint outputs a dense probability map for interest points and a semi-dense descriptor field.
- Keypoint Selection:
- Threshold at
- Apply 2D NMS with a radius (4 or 8 pixels) to enforce spatial separation
- Select the top points by confidence (e.g., 300 for repeatability, 1000 for homography estimation)
- Descriptor Extraction:
For each detected keypoint , the -dimensional descriptor is sampled from the upsampled descriptor map via bicubic interpolation; all descriptors are L2-normalized.
- Matching Strategy:
Matching is performed with nearest-neighbor in Euclidean space, optionally employing a ratio test or mutual nearest neighbor check.
5. Training Procedure and Data
Training comprises synthetic pre-training, Homographic Adaptation, and joint detection-description optimization.
- Synthetic Pre-Training (MagicPoint):
Training on rendered shapes with known corner positions for 200k iterations, with batch size 32, Adam optimization (lr=, , ) and random homography data augmentation.
- Homographic Adaptation on MS-COCO:
Uses 80,000 COCO2014 images, resized to grayscale. For each image, random homographies are applied, and detector heatmaps are averaged to build robust pseudo-labels. A second adaptation round refines this estimation.
- Joint SuperPoint Training:
Pairs each COCO image with a mild random homography to generate pairs. Hyperparameters: , , , , . Batch size 32, Adam optimizer (lr=$0.001$), with standard vision augmentations (Gaussian noise, motion blur, photometric changes).
6. Evaluation and Comparative Results
Performance is primarily assessed on the HPatches dataset for repeatability and homography estimation.
HPatches Repeatability (240×320 resolution, 300 points, pixels)
| Detector | NMS=4 | NMS=8 |
|---|---|---|
| SuperPoint | 0.652 | 0.631 |
| MagicPoint | 0.575 | 0.507 |
| FAST | 0.575 | 0.472 |
| Harris | 0.620 | 0.533 |
| Shi-Tomasi | 0.606 | 0.511 |
| Random | 0.101 | 0.103 |
Analogous improvements are observed under viewpoint variation.
HPatches Homography Estimation (480×640, 1000 points)
Correct if corner-transfer error pixels.
| Method | |||
|---|---|---|---|
| SuperPoint | 0.310 | 0.684 | 0.829 |
| LIFT | 0.284 | 0.598 | 0.717 |
| SIFT | 0.424 | 0.676 | 0.759 |
| ORB | 0.150 | 0.395 | 0.538 |
Breakdown at :
- Repeatability: SuperPoint 0.581, LIFT 0.449, SIFT 0.495, ORB 0.641
- Mean Localization Error (px): SuperPoint 1.158, LIFT 1.102, SIFT 0.833, ORB 1.157
- Nearest Neighbor mAP (desc.): SuperPoint 0.821, LIFT 0.664, SIFT 0.694, ORB 0.735
- Matching Score: SuperPoint 0.470, LIFT 0.315, SIFT 0.313, ORB 0.266
Qualitative results demonstrate dense and robust correspondence under illumination changes; failure modes arise on extreme in-plane rotations outside the sampled training distribution.
7. Significance and Observed Limitations
SuperPoint demonstrates that fully-convolutional, self-supervised architectures can jointly learn interest point detection and dense description from unlabeled natural images, attaining strong geometric invariance without manual annotation. The Homographic Adaptation approach enables effective transfer from synthetic shapes to real scenes through unsupervised pseudo-label aggregation. The observed limitations include reduced performance under extreme in-plane rotations exceeding the geometric conditions encountered during training, which suggests the potential need for more extensive augmentation regimes or alternative geometric regularization approaches for increased robustness.