SP-VINS: Advances in Visual-Inertial SLAM
- SP-VINS is a family of visual-inertial navigation and SLAM systems that integrate diverse sensors such as stereo cameras, monocular vision, sonar, and depth sensing for robust mapping.
- The systems employ advanced sensor fusion strategies including implicit mapping, deep-feature extraction, and sliding window optimization to achieve sub-centimeter accuracy and efficient drift correction.
- Real-time loop closure, online extrinsic calibration, and multi-modal integration enable reliable performance across indoor, outdoor, and underwater environments.
SP-VINS refers to a family of visual-inertial navigation and SLAM systems whose specific capabilities, architectures, and sensor integration strategies vary by context and research group. The term encompasses at least three distinct systems, each addressing unique robotics challenges: a filter-based stereo VINS leveraging implicit environmental mapping (Du et al., 24 Nov 2025); a deep-feature visual-inertial SLAM specialized for challenging imaging conditions (Luo et al., 31 Jul 2024) (also styled as "SuperVINS"); and a tightly-coupled underwater SLAM system fusing sonar, vision, IMU, and depth sensing (Rahman et al., 2018). Common threads are keyframe management, robust sensor fusion strategies, and an emphasis on real-time operation. The following treats the major architectures and their technical features with precise adherence to the referenced literature.
1. Architectural Taxonomy and Core Innovations
Three principal SP-VINS implementations are recognized in the literature:
| Variant | Sensor Modalities | Estimation Framework | Map Representation |
|---|---|---|---|
| SP-VINS (2025) | Stereo Cameras, IMU | Filter-based (DST-EKF) | Implicit keyframe+2D pts |
| SuperVINS (2024) | Monocular Camera, IMU | Sliding window BA (VINS) | 3D landmarks + pose-graph |
| SP-VINS/SVIn2 (2018) | Sonar, Stereo/Mono Vision, IMU, Depth | Nonlinear BA (OKVIS-inspired) | Keyframes + local 3D map |
- The 2025 system introduces an implicit environmental map for long-term filtering without explicit 3D landmarks (Du et al., 24 Nov 2025).
- SuperVINS pivots to deep-learned keypoint and matching pipelines, using SuperPoint/LightGlue for superior resilience in degraded visual conditions, and incorporates bundle-adjustment for state optimization (Luo et al., 31 Jul 2024).
- The underwater system coordinates additional sonar and depth modalities using a tightly-coupled, keyframe-based, nonlinear optimization with loop-closing (Rahman et al., 2018).
2. Filter-Based Stereo VINS with Implicit Map (Du et al., 24 Nov 2025)
SP-VINS (2025) addresses limitations of classic filter-based VINS—namely, long-term drift from limited mapping—by discarding explicit 3D landmark management. Instead, it introduces an implicit environmental map: a database of keyframes, each storing its pose and tracked 2D keypoints with descriptors.
State Representation
The full filter state at instant is:
where includes:
- IMU biases and active state,
- a sliding window of pose clones,
- camera-IMU extrinsics (quaternions and translations for both left and right cameras).
stores the poses of all map keyframes.
Measurement and Map Update
Visual processing fuses two complementary residuals for each stereo feature:
- Landmark reprojection residuals compare normalized projection via current state to observed 2D image coordinates.
- Ray-constraint residuals encode depth consistency, leveraging either multi-view covisibility or stereo geometry.
These are jointly stacked and linearized into a unified measurement model for each EKF update:
- All pose, extrinsic, and keyframe parameters are updated simultaneously.
Loop Closure
Loop detection and constraint formation is conducted offline and at runtime via DBoW2 plus geometric verification (RANSAC-based fundamental and PnP tests). On each detected loop, 2D–2D or 3D–2D reprojection residuals between current and historic keyframes are injected as EKF updates, correcting drift efficiently without global bundle adjustment.
Online Extrinsic Calibration
Camera-IMU extrinsics are explicitly included in the filter state. Both reprojection and ray constraints contribute measurement Jacobians with respect to these extrinsics, ensuring online calibration is performed during normal operation.
Efficiency and Benchmarks
SP-VINS achieves high accuracy (sub-centimeter level on EuRoC and TUM-VI with loop closure enabled), with loop closure costs as low as ∼1 ms per keyframe. Average CPU usage remains within 140% of a single high-end core, significantly outperforming optimization-centric baselines in efficiency.
3. Deep Learning-Augmented SLAM: SuperVINS (Luo et al., 31 Jul 2024)
SuperVINS introduces a deep learning pipeline on top of the VINS-Fusion core, aiming for maximal resilience in visual-inertial SLAM under conditions such as low light or severe motion blur.
Front-End: Feature Extraction and Matching
- SuperPoint is applied to every incoming image for keypoint detection and $256$-dimensional local descriptor computation.
- LightGlue, an attention-based feature matcher, computes soft-assignment correspondences between frame pairs, enforcing mutual nearest neighbor constraints and per-point matchability.
- RANSAC with a strict inlier threshold ( px) filters outliers for robust geometric consistency.
Back-End: Sliding-Window Bundle Adjustment
SuperVINS inherits the VINS-Fusion state and cost:
- Sliding window BA optimizes over camera poses, 3D landmarks, IMU biases.
- Cost terms: visual reprojection residuals, IMU preintegration, regularization.
Loop Closure
Descriptors from SuperPoint are aggregated into a DBoW3 visual vocabulary. Candidate loops are retrieved via bag-of-words scoring, with geometric verification (PnP + RANSAC), and loop edges are added to the pose-graph for global correction.
Implementation and Performance
- GPU inference (via ONNXRuntime) for SuperPoint and LightGlue enables per-frame frontend processing under 50 ms on mid-tier hardware.
- Real-time localization with a sliding window of 10 keyframes.
- Robustness: On EuRoC, SuperVINS regains tracking in challenging scenarios (e.g., V202), and is strictly superior to VINS-Fusion on a majority of sequences (see Table 1 in (Luo et al., 31 Jul 2024) for ATE and RPE metrics).
Ablation Findings
The combination of deep features, attention-based matching, and strict RANSAC filtering yields maximum benefit; omitting any component substantially increases drift or risk of tracking loss.
4. Multi-Modal SLAM for Underwater Environments (Rahman et al., 2018)
SP-VINS/SVIn2 targets GPS-denied underwater operation by tightly fusing sonar, vision, IMU, and depth sensing.
Sensor Integration
- Sonar: 100 Hz planar scans produce environmental structure cues resilient to turbidity, providing scale anchoring via alignment with visual landmarks.
- Pressure-based depth: Abs z-axis anchoring via sparse but absolute readings.
- Visual: ORB keypoints, stereo or monocular, with online preprocessing (CLAHE) to overcome poor contrast and color cast.
- IMU: 100 Hz, with bundle-adjustment windowed over current keyframes.
State Estimation and Initialization
A two-stage scale alignment aligns vision to depth (for ) and then vision+IMU for (velocity, orientation, and gravity). All residuals—reprojection, IMU preintegration, sonar, depth—are fused in a large nonlinear optimization framework.
Loop Closing and Relocalization
Uses DBoW2 on keyframe ORB descriptors; candidate loops are validated geometrically and integrated as pose-graph constraints, then optimized to correct for accumulated drift.
Empirical Results
Field and synthetic benchmarks (EuRoC, lake and cave datasets) demonstrate robust drift-free operation under conditions that challenge visual-only or even visual-inertial baselines, with relocalization errors <0.3 m over loops >50 m. SP-VINS demonstrates superior reliability especially where alternative methods either diverge or become inoperable.
5. Loop Closure Strategies and Map Representation
Across variants, loop closure is either managed via a keyframe bag-of-words database (SP-VINS, SuperVINS, SVIn2) or deferred to global pose-graph optimization, with major architectural differences:
- The implicit environmental map (SP-VINS 2025) holds only keyframe poses and 2D feature sets, never reconstructing explicit 3D landmarks.
- Deep-feature systems (SuperVINS) leverage learned descriptors, enabling high recall and precision in low-information regimes.
- Underwater SP-VINS exploits geometric and learned image enhancement to retain feature stability despite radiometric and turbidity challenges.
All systems perform geometric loop verification (fundamental or PnP RANSAC), rejecting spurious loops prior to map correction.
6. Experimental Benchmarks and Comparative Results
| Dataset | Benchmark Metric | SP-VINS (2025) | SuperVINS (2024) | SVIn2 (2018) |
|---|---|---|---|---|
| EuRoC | ATE, RPE | Outperforms or matches all filter-based; SP-VINSLC better than VO baselines (Du et al., 24 Nov 2025) | Outperforms VINS-Fusion in 6/11, only system to finish V202 (Luo et al., 31 Jul 2024) | Lower RMSE than OKVIS-stereo (Rahman et al., 2018) |
| TUM-VI | ATE | Best results on all c1–r6 with loop closure | Not benchmarked | Not benchmarked |
| KAIST | ATE, RPE | Major drift reduction vs. baselines | Not benchmarked | Not benchmarked |
| Underwater | Loop closure, drift | Not benchmarked | Not benchmarked | Sub-meter relocalization (Rahman et al., 2018) |
SP-VINS systems thus offer state-of-the-art accuracy-computation trade-offs in visual-inertial navigation, with domain-specific advances in implicit mapping, deep-feature robustness, and multi-modal tight coupling.
7. Limitations and Future Research Directions
Main limitations are noted in all principal works:
- The 2025 SP-VINS: Loop closure remains vulnerable to repetitive or textureless scenes due to the reliance on classical BoW and RANSAC. Absence of global BA constrains map consistency, suggesting future integration of semantic or learning-aided priors, and lightweight pose-graph smoothing (Du et al., 24 Nov 2025).
- SuperVINS: Lacks an ablation table but stresses that the synergy of all components is required for maximal robustness in adverse conditions (Luo et al., 31 Jul 2024).
- SVIn2: System throughput is a limitation (5–10 Hz CPU), and association of sonar and vision can fail in non-coplanar geometries. Enhancements under exploration include higher-rate 3D acoustic sensors and collaborative SLAM in multi-vehicle scenarios (Rahman et al., 2018).
These systems collectively advance the field in minimizing drift, increasing deployment reliability, and reducing hardware demands—while providing flexible frameworks for future integration of semantic, learning-based, and cross-modal enhancements.