Papers
Topics
Authors
Recent
Search
2000 character limit reached

SuperPoint-E: Endoscopic SfM Features

Updated 3 March 2026
  • The paper introduces domain-specific Tracking Adaptation that improves keypoint detection and descriptor learning by leveraging COLMAP-based 3D supervision.
  • It uses the original SuperPoint architecture with no structural changes while optimizing loss functions to overcome challenges like specular highlights in endoscopic videos.
  • Empirical results show up to 4x denser 3D point recovery and enhanced spatial uniformity, significantly boosting reconstruction quality in medical imaging.

SuperPoint-E is a local feature extraction method designed to enhance Structure-from-Motion (SfM) reconstruction performance in endoscopic video. Building on the unmodified SuperPoint architecture, SuperPoint-E introduces domain-specific supervision via a Tracking Adaptation strategy, yielding denser detections, improved precision, discriminative descriptors, and robustness to specular highlights. The approach leverages COLMAP-based 3D reconstructions to supervise both keypoint detection and descriptor learning, directly addressing the challenges unique to medical endoscopy imaging (Barbed et al., 4 Feb 2026, Barbed et al., 2022).

1. Network Architecture and Variants

SuperPoint-E employs the original fully-convolutional VGG-style encoder of SuperPoint:

  • Encoder: Eight convolutional layers (3×3 kernels, stride 1, ReLU, channels {64, 64, 64, 64, 128, 128, 128, 128}), with 2×2 max-pooling after every two convolutions, yielding downsampling by a factor of 8.
  • Detection Head: A 3×3 convolutional layer (256 channels, ReLU), then a 1×1 convolution producing a H8×W8×65\frac{H}{8} \times \frac{W}{8} \times 65 tensor (64 spatial bins + 1 “no-point”). Softmax is applied channel-wise.
  • Descriptor Head: A 3×3 convolution (256 channels, ReLU), followed by a 1×1 convolution (256 channels), with spatial L2 normalization.
  • Input/Inference Hyperparameters: 256×256 grayscale input, detection threshold 0.0005, NMS radius 4 px, max 10,000 keypoints per image.

No architectural changes are made relative to the original SuperPoint; all changes lie in domain-adapted supervision and loss functions. The design aligns with findings that feature selection and description, rather than network topology, are the limiting factors in endoscopic scenarios (Barbed et al., 4 Feb 2026, Barbed et al., 2022).

2. Tracking Adaptation Supervision Strategy

SuperPoint-E’s training regime fundamentally differs from the original SuperPoint and prior specularity-robust adaptations. The “Tracking Adaptation” approach consists of:

  • Ground-Truth Track Extraction: Run standard COLMAP SfM (feature extractor + guided matcher → mapper) on short colonoscopy sequences. For each reconstructed 3D point, its projection is labeled “green” if COLMAP detected it in a frame, “blue” otherwise.
  • Reliable Track Construction: Maximal contiguous “green” subsequences (with “blue” outliers when sandwiched by “green”) define reliable multi-frame tracks. Only these are used as correspondences.
  • Loss Functions:
    • Detection Loss LdetL_{\mathrm{det}}: Cross-entropy between the softmaxed detection logits and ground-truth detection mask YnY_n at each spatial location, as in SuperPoint:

    Ldet(Xn,Yn)=x,y[Yn(x,y)logPn(x,y)+(1Yn(x,y))log(1Pn(x,y))].L_{\mathrm{det}}(X_n, Y_n) = -\sum_{x,y} [ Y_n(x,y)\log P_n(x,y) + (1-Y_n(x,y))\log(1-P_n(x,y)) ]. - Descriptor Loss LtrackL_{\mathrm{track}}: Batch-hard triplet loss using track-wise positive and negative pairs,

    ltrip(d+,d+;d)={max(0,mpd+,d+)(positive) max(0,d,dmn)(negative)l_{\mathrm{trip}}(d^+, d^+; d^-) = \begin{cases} \max(0, m_p - \langle d^+, d^+ \rangle) & \text{(positive)} \ \max(0, \langle d^-, d^- \rangle - m_n) & \text{(negative)} \end{cases}

    with mp=1.0m_p=1.0, mn=0.2m_n=0.2, ensuring descriptors are consistent along tracks and separable between tracks.

The overall loss is:

LSPE(X1..XN,D1..DN)=nLdet(Xn,Yn)+λ1a<bNLtrack(Da,Db;Ta,b)L_{\mathrm{SPE}}(X_1..X_N, D_1..D_N) = \sum_n L_{\mathrm{det}}(X_n, Y_n) + \lambda \sum_{1\leq a < b \leq N} L_{\mathrm{track}}(D_a, D_b; T_{a,b})

with λ=1\lambda=1.

This strategy ensures that the detector and descriptor are jointly optimized to maximize track survivability and discriminative power over challenging, real-world endoscopy sequences (Barbed et al., 4 Feb 2026).

3. Training Procedure and Data

  • Training Data: 65 EndoMapper subsequences (4–7 s, 202–342 frames each, 16,663 total frames), with 11,259 frames reconstructed by COLMAP serving as supervision. Augmentation includes a second set of tracks from SuperPoint+SuperGlue matches.

  • Data Augmentation: Random brightness (±50), contrast (α∼Uniform[0.5,1.5]), additive speckle noise (p[0,0.0035]p∼[0,0.0035]), Gaussian noise (σ[0,10]\sigma∼[0,10]), elliptical vignetting, and motion blur.

  • Training Hyperparameters:

    • Input normalization: [0,1]
    • Batch: 4 frames with shared tracks
    • Optimizer: Adam, lr=1×105\mathrm{lr} = 1\times10^{-5}
    • Iterations: 400,000 batches (1.6M images)
    • Loss weights: λ=1\lambda=1, mp=1.0m_p=1.0, mn=0.2m_n=0.2
    • Detector labels: Gaussian-blurred mask (σ=0.2px\sigma=0.2\,\text{px})
  • Runtime Performance:
    • Feature extraction: \sim20 ms/image (GPU, 256×256)
    • Brute-force matching: \sim290 ms/pair for 10,000×\times10,000 descriptors (CPU)
    • Guided matching (RANSAC): \sim1.28 s/pair (CPU)

4. Quantitative and Qualitative Evaluation

Extensive evaluation on EM-Test (7 subsequences), EM-Full (long videos), and domain-shift datasets show:

  • SfM Reconstruction Quality (EM-Test):
Detector+Matcher Precision % %Images Reconst. #3D pts (K) Track-length (imgs) MAE (px) Spread % Specular %
SIFT+GM (COLMAP) 46.1 87.1 10 9.12 1.31 43.9 28.6
SP+BF 40.6 96.0 22 7.05 1.55 72.3 15.6
SP-E+BF 60.5 99.5 76 10.78 1.78 85.2 6.2
SP+GM 57.7 100 49 5.02 1.44 91.7 11.3
SP-E+GM 63.2 100 77 11.28 1.79 86.3 6.7

SP-E recovers \sim4x more points with higher spatial uniformity (spread) and an order of magnitude reduction in specular feature incidents (specular %).

  • Domain Shift: On duodenum (EM-Gastro) and lung phantom (C3VD-Test), SP-E trained on colonoscopy still surpasses SIFT, e.g., 53.1% vs 40.9% precision and 99.8% vs 80.2% images reconstructed, respectively.
  • Qualitative:
    • Point clouds show markedly denser coverage (up to 4x denser) and reduced “holes.”
    • Features are robust to specularities (<<7% of points on specular regions) and illumination changes.
    • Track lengths and pose stability are higher under rapid light variation.

5. Comparison to Prior Specularity-Robust SuperPoint Variants

Alternative adaptations focus on suppression of specular features via direct pixel masking. E-SuperPoint (Barbed et al., 2022) introduces a specularity loss:

Ls(X,I)=h,wm(I)hwPhwϵ+h,wm(I)hw\mathcal{L}_s(\mathcal{X}, I) = \frac{\sum_{h,w} m(I)_{hw} P_{hw}}{\epsilon + \sum_{h,w} m(I)_{hw}}

with mask m(I)m(I) derived from intensity thresholding, dilation, and blurring, heavily penalizing keypoint probability in glare. This is additive to the standard SuperPoint cross-entropy and descriptor loss:

LESP=LSP+λs[Ls(X,I)+Ls(X,I)]\mathcal{L}_{ESP} = \mathcal{L}_{SP} + \lambda_s \bigl[\mathcal{L}_s(\mathcal{X},I)+\mathcal{L}_s(\mathcal{X}',I')\bigr]

Quantitatively, E-SuperPoint achieves 4500.9 features/image (vs. 1333.7 for vanilla SP and 2350.2 for SIFT), with >>98% of features outside glare regions and higher RANSAC inliers and pose accuracy (Barbed et al., 2022). By construction, its loss almost eliminates specular-region features.

SuperPoint-E’s tracking-based supervision, by contrast, does not use explicit masking, instead leveraging multi-frame consistency to focus detection and description learning on features valuable for longer SfM tracks. Empirical results indicate that this increases both reconstruction coverage and infrastructure adherence under the more stringent requirements of structure-from-motion (Barbed et al., 4 Feb 2026).

6. Practical Considerations in Endoscopic SfM

  • Throughput: With 20 ms/image detection and 290 ms/pair matching (10,000 descriptors), SuperPoint-E is deployable at 2–5 Hz on modern hardware using a temporally local matching window.
  • Resource Requirements: RTX 2080 GPU or better (8GB), quad-core CPU ≥3GHz, and 16GB RAM are recommended for real-time or near real-time operation.
  • Integration: Pipeline can incorporate SuperPoint-E for feature extraction with BF matching or occasional guided matching. Sliding window SfM is practical.
  • Coverage and Density Gains: Compared to SIFT, SuperPoint-E yields 3–4x denser reconstructions and approximately 2x more images reconstructed on long colonoscopy sequences.
  • Limitations: Extreme low-light or occluded scenes remain challenging. While window of reliable 3D tracking is doubled, total failure is not eliminated under the most adverse conditions.

7. Significance and Prospects

SuperPoint-E demonstrates that reusing a generic local feature backbone with endoscopy-specific, track-level supervision can more than double both 3D point density and spatiotemporal coverage in SfM, compared to state-of-the-art classical (SIFT) and learning-based (vanilla SuperPoint, specularity-suppressed variants) feature methods. The approach provides an effective template for adaptation of point and descriptor networks to environments in which feature survivability, rather than instantaneous saliency, determines downstream reconstruction quality (Barbed et al., 4 Feb 2026, Barbed et al., 2022). A plausible implication is that track-consistency-oriented supervision can generalize to other domains exhibiting dense temporal overlap and nontrivial nuisance signals (e.g., underwater, surgical, or low-light industrial sequences).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SuperPoint-E.