SuperPoint-E Endoscopy Adaptation
- SuperPoint-E is a domain-adapted local feature extraction framework that integrates specularity masking and tracking supervision to enhance 3D reconstruction from endoscopic imagery.
- It employs a VGG-style encoder with dual detector/descriptor heads, producing smoothed keypoint maps and robust 256-dimensional descriptors tailored for endoscopic challenges.
- Evaluations on benchmarks like EndoMapper and Hyper-Kvasir show that SuperPoint-E achieves higher detection precision, denser 3D point clouds, and improved spatial spread compared to classical methods.
SuperPoint-E (Endoscopy Adaptation) refers to a set of domain-adapted local feature extraction frameworks designed to enhance 3D reconstruction and structure-from-motion (SfM) from endoscopic imagery. This approach builds upon the original SuperPoint architecture, introducing a range of endoscopy-specific adaptations, including specularity masking and, more recently, a tracking-based supervision paradigm. The primary aim is to maximally leverage the unique characteristics of endoscopic data—namely, repetitive mucosal textures, transient specular highlights, and frequent image deformation—resulting in denser, more stable, and less artifact-prone 3D reconstructions compared to both classical methods (such as SIFT) and generic deep learning baselines (Barbed et al., 2022, Barbed et al., 4 Feb 2026).
1. Model Architecture and Domain-Specific Adaptations
Both “E-SP” (Barbed et al., 2022) and the later SuperPoint-E (Barbed et al., 4 Feb 2026) retain the core backbone structure of SuperPoint: a fully convolutional encoder, followed by a bifurcated head system (detector and descriptor). The encoder is comprised of a VGG-style stack (eight 3×3 convolutional layers with ReLU activations and max-pooling). The detector head yields a H/8×W/8×65 tensor—64 channels for 8×8 grid patches and one “dustbin”—producing a smoothed keypoint probability map. The descriptor head emits a per-patch 256-dimensional descriptor map, upsampled and L₂-normalized.
In E-SP, domain adaptation is accomplished via a novel specularity mask. High-intensity image regions (those where intensity >0.7, after normalization) are identified, dilated morphologically, and blurred to form a spatial mask that suppresses keypoint responses within or near specular zones. This penalty is integrated directly into training and inference, resulting in features that avoid unstable specular blob locations and concentrate on persistent mucosal textures.
SuperPoint-E (Barbed et al., 4 Feb 2026) extends beyond such heuristic masking by fundamentally altering the supervision signal for adaptation, as detailed below.
2. Tracking Adaptation Supervision
A central methodological advance in SuperPoint-E is the shift from homography-based pseudo-labeling to "Tracking Adaptation." Rather than warp individual images through synthetic homographies, supervision is derived from robust multi-view tracks extracted from real endoscopic video using an external SfM module (COLMAP). The process is as follows:
- Short (4–7 s) subsequences amenable to SfM are identified.
- COLMAP is run, yielding camera poses, sparse 3D point clouds, and visibility tracks recording when each 3D structure point was observed in the video.
- For supervision, 3D points are reprojected into all their visible frames, marking these 2D locations as “reliable” (i.e., confident and consistent across views).
- Mini-batch sampling is constrained so that every training pair shares at least one reliable 3D track. The resultant correspondences drive both detector (keypoint heatmap) and descriptor (cross-frame track) loss terms.
This approach tightly couples the feature learning process to the actual geometric and temporal constraints of the multiview data, encoding appearance changes, motion blur, and partial occlusions directly into the adaptation procedure (Barbed et al., 4 Feb 2026).
3. Training Objectives and Optimization
The loss for SuperPoint-E is a composite of a detection objective and a descriptor tracking objective. For a training batch of images:
- Detection loss:
where encodes Gaussian-blurred ground truth at each reliable reprojection (σ=0.2 px).
- Descriptor (tracking) loss:
with triplet loss
and hyperparameters , , .
The final loss is a sum over all batch frames and track pairs. No additional homography-consistency or regularization terms are used, marking a divergence from standard SuperPoint adaptation protocols.
In E-SP, specularity masking is added as a per-image term:
with , combined via a weight (Barbed et al., 2022).
4. Training Procedures and Datasets
E-SP is fine-tuned from synthetic-pretrained SuperPoint, using pseudo-labels (100 random homographies) on endoscopy frames and masking penalty. Training is performed on 125,000 frames sampled from 11 colonoscopy procedures at 256×256 resolution, with test/validation on EndoMapper and Hyper-Kvasir datasets.
SuperPoint-E is trained using direct correspondences mined from COLMAP-reconstructions. The EM-Train set contains 11,259 reconstructed frames filtered from 16,663. Validation is performed on EM-Test (seven held-out subsequences), EM-Full (complete colonoscopy videos), EM-Gastro (gastroscopy), and C3VD-Test (phantom, ground-truth pose).
Data augmentations include contrast and brightness scaling, additive speckle and Gaussian noise, elliptical shading, and motion blur. Batch size is 4; training proceeds for 400,000 steps with Adam and learning rate 1e−5 (Barbed et al., 4 Feb 2026).
5. Evaluation Framework and Metrics
Evaluation is based on full SfM pipelines:
- Detection and description: using SIFT+Guided Matching (default COLMAP), SuperPoint (base), SuperPoint-E, each plugged into COLMAP’s matching, triangulation, and bundle adjustment phases.
- Matching: Both brute-force (L₂, no ratio test) and guided matching strategies are assessed.
- RANSAC filtering is used for geometric outlier removal (confidence 0.9999, inlier threshold 3 px).
- For EndoMapper, pseudo-GT camera poses are computed by COLMAP, enabling re-projection-based inlier and pose error evaluation.
Key reported metrics:
- Precision: % of 2D detections surviving to the final 3D model.
- 3D points: Model size (thousands of points).
- Track length: Mean re-observation window per 3D point.
- Spread: % of grid cells (16×16) containing at least one inlier.
- Rotation/Absolute Trajectory Error: Using ground-truth or pseudo-GT poses.
- Specular contamination: % of reconstructed 2D points on high-intensity pixels.
- Coverage: Fraction of video frames included in any reconstructed submap.
- Submap statistics: Number and size of contiguous reconstructed segments.
6. Quantitative Results and Analysis
On EndoMapper and EM-Test benchmarks, SuperPoint-E demonstrates significant improvements over classical and deep baselines. Representative results (averaged over 7 EM-Test subsequences):
| Method | Precision | 3D points | Track len. | Spread | Specular %↓ |
|---|---|---|---|---|---|
| SIFT+GM | 46.1% | 10k | 9.12 | 43.9% | 28.6% |
| SP+BF | 40.6% | 22k | 7.05 | 72.3% | 15.6% |
| SP-E+BF | 60.5% | 76k | 10.78 | 85.2% | 6.2% |
| SP+GM | 57.7% | 49k | 5.02 | 91.7% | 11.3% |
| SP-E+GM | 63.2% | 77k | 11.28 | 86.3% | 6.7% |
SuperPoint-E achieves the highest detection precision, largest 3D clouds, and longest tracks. Spread exceeds 85% of the frame, while specular contamination falls below 7%. Notably, brute-force matching approaches parity with guided matching for SuperPoint-E, indicating highly discriminative descriptors.
For full-colonoscopy coverage (EM-Full), SP-E+GM reconstructs 33.2% of frames (vs. 15.1% for SIFT+GM), roughly doubling coverage and producing larger submaps. On gastroscopy data, SuperPoint-E generalizes without additional retraining, boosting both precision (53.1% vs 40.9%) and spatial spread (60.2% vs 36.7%).
Absolute trajectory accuracy on synthetic (C3VD-Test) is essentially unchanged relative to SIFT (ATE ≈4.6 mm), but far denser and more robust point clouds are achieved.
Ablation studies and qualitative visualizations confirm that domain-adapted models avoid over-representation of specular highlights and yield more uniformly distributed, geometrically stable keypoints.
7. Implementation, Efficiency, and Limitations
Export is supported for interoperability with COLMAP via plain-text keypoint + descriptor files. Feature matching is executed using OpenCV’s BFMatcher; downstream triangulation and bundle adjustment are handled natively in COLMAP.
Efficiency is competitive: detection on GPU (256×256) takes 20–25 ms/image, while brute-force descriptor matching is fast (<300 ms/pair for SuperPoint-E). The model size is compact (11 MB in PyTorch, for E-SP), with GPU RAM ≈1 GB for batch size 2.
Despite these strengths, current limitations remain. Reliance on offline SfM for supervision constrains training efficiency and domain coverage. Large non-rigid deformations and rapid tissue motion can still degrade geometric matching. There is no explicit scale or rotation invariance beyond what is provided by learned homographies or track sampling. Rigid RANSAC techniques are not robust to deformations exceeding ~30°, and matching remains sensitive to dramatic appearance changes (Barbed et al., 2022, Barbed et al., 4 Feb 2026).
Future work is anticipated in several directions: (i) closing the adaptation loop with real-time SLAM or incremental SfM; (ii) incorporating deformation-aware losses or photometric specularity modeling; (iii) extending the architecture to multi-scale feature maps and integrating descriptor-matching networks (e.g., a specularity-robust SuperGlue variant); and (iv) adapting the framework for end-to-end Neural Radiance Fields targeting dense surface recovery (Barbed et al., 4 Feb 2026).
SuperPoint-E demonstrates that local feature networks, when supervised via real 3D track correspondences from endoscopic SfM, yield marked advances in keypoint repeatability, spatial distribution, and 3D reconstruction coverage on challenging endoscopic datasets, with broad potential for further domain-specific improvements (Barbed et al., 2022, Barbed et al., 4 Feb 2026).