Multi-modal Loop Closure Detection with Foundation Models in Severely Unstructured Environments (2511.05404v1)
Abstract: Robust loop closure detection is a critical component of Simultaneous Localization and Mapping (SLAM) algorithms in GNSS-denied environments, such as in the context of planetary exploration. In these settings, visual place recognition often fails due to aliasing and weak textures, while LiDAR-based methods suffer from sparsity and ambiguity. This paper presents MPRF, a multimodal pipeline that leverages transformer-based foundation models for both vision and LiDAR modalities to achieve robust loop closure in severely unstructured environments. Unlike prior work limited to retrieval, MPRF integrates a two-stage visual retrieval strategy with explicit 6-DoF pose estimation, combining DINOv2 features with SALAD aggregation for efficient candidate screening and SONATA-based LiDAR descriptors for geometric verification. Experiments on the S3LI dataset and S3LI Vulcano dataset show that MPRF outperforms state-of-the-art retrieval methods in precision while enhancing pose estimation robustness in low-texture regions. By providing interpretable correspondences suitable for SLAM back-ends, MPRF achieves a favorable trade-off between accuracy, efficiency, and reliability, demonstrating the potential of foundation models to unify place recognition and pose estimation. Code and models will be released at github.com/DLR-RM/MPRF.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about
Imagine a rover exploring the Moon or Mars. There’s no GPS to tell it where it is. It builds its own map as it moves, and it must notice when it comes back to a place it has already been. That moment is called a “loop closure,” and it helps correct its map and position.
This paper introduces MPRF, a system that helps robots reliably spot loop closures in very difficult places, like rocky, dusty, or look‑alike terrains. It uses two kinds of sensors:
- A camera (vision)
- A laser scanner called LiDAR (it measures distances and shapes)
MPRF uses powerful “foundation models” (big AI models trained on lots of data) for both images and 3D point clouds to both find likely matches and compute the exact 3D movement between two visits to the same place.
The big questions the paper asks
- How can a robot recognize it has been in the same place before when the ground looks confusing, repetitive, or has very few visual clues?
- Can combining camera images and LiDAR shape data work better than using only one?
- Instead of just saying “these two places look similar,” can we also calculate the full 3D change in position and direction (called a 6‑DoF pose) so the result can plug directly into a mapping system?
How they did it
Think of MPRF as a two-step process: first, quickly find good candidates; second, verify them precisely in 3D.
Step 1: Fast visual search (finding candidates)
- They use a vision “transformer” model called DINOv2. You can think of it as a very smart tool that breaks an image into small pieces (patches) and turns them into numbers (features) that capture what’s in the picture.
- These features are combined using a method called SALAD. In simple terms, SALAD groups important visual clues and ignores useless ones, creating one compact “signature” for each image.
- With these signatures, the system uses a fast search engine (FAISS) to find a short list of past images that look most similar to the current view.
- Then it refines that list using richer DINOv2 features from several transformer layers to keep the best candidates. This two-stage check is like first scanning quickly, then taking a closer look.
Why not use LiDAR here? In these very rough, empty terrains, LiDAR alone didn’t help much for the initial search. Vision worked better for this first step.
Step 2: Precise 3D check (verifying and computing pose)
- Now they bring in LiDAR. They use a 3D model called SONATA to describe the shapes and structures in the point cloud.
- They “fuse” the camera and LiDAR information: image features are lifted into 3D using the LiDAR’s depth, then combined with the LiDAR’s own 3D features. This creates paired visual‑plus‑shape descriptors for the same physical points.
- They match these 3D points between the two visits and run RANSAC, a robust method that tries many hypotheses and keeps the one that fits the most matches. This computes the full 6‑DoF pose (3 for position: x, y, z; and 3 for rotation: roll, pitch, yaw).
- Finally, they can refine the alignment with ICP, a method that carefully nudges two 3D point clouds to best fit each other.
Key ideas explained simply:
- SLAM: The robot makes a map while figuring out where it is on that map.
- Loop closure: Realizing “I’ve been here before,” which helps correct drift in the map and position.
- Foundation models: Large AI models trained on tons of data, so they work well in many places without much extra training.
- 6‑DoF pose: Full 3D movement and rotation between two positions.
- RANSAC: Try‑and‑test approach that ignores bad matches and keeps the most consistent solution.
- ICP: A fine‑tuning step to make two 3D shapes line up more tightly.
What they found and why it matters
- Stronger retrieval (finding the right place again): The visual part of MPRF, using DINOv2 + SALAD, was very accurate and fast. On one dataset, it picked the correct place first about 76% of the time, and the whole retrieval step took under half a second per query on their hardware.
- LiDAR helps precisely align poses: While LiDAR wasn’t very helpful for the first quick search, it was crucial for the precise 3D check afterward. By fusing vision and LiDAR features, MPRF could compute reliable 6‑DoF poses and offer actual point‑to‑point matches that a SLAM system can trust.
- More robust in tough scenes: In places with weak textures (few visual details) or repetitive patterns (lots of look‑alike rocks), vision‑only or LiDAR‑only methods struggled. The fusion did better, with yaw (turning) errors around 8 degrees on average—competitive with fast learning-based baselines but with interpretable, checkable matches.
- Works on new terrains: MPRF also performed well on a different volcanic area (Vulcano Island), showing it generalizes to new, tricky environments.
Why this matters:
- A rover or robot can not only find candidate matches but also compute the exact 3D correction. This “all the way to 6‑DoF” result plugs directly into mapping systems, improving reliability.
What this could change
- More dependable exploration: Robots on the Moon, Mars, or in GPS‑denied places on Earth can navigate with fewer mistakes, even when the ground looks confusing.
- Less task‑specific training: Using foundation models means the system benefits from large-scale pretraining, so it needs less custom data to work well in new environments.
- Easier to trust: Because MPRF provides actual point correspondences and a computed pose, engineers can verify why a loop closure was accepted. This interpretability is safer than black‑box predictions.
- Future directions: Speeding up the 3D pose step and integrating the pipeline tightly into full SLAM systems could make real-time, multi-sensor mapping even more reliable.
In short, MPRF shows that combining smart image models and LiDAR shape understanding can help robots recognize places and compute precise 3D motion in some of the hardest environments—bringing us closer to safer, more robust autonomous exploration.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of what remains missing, uncertain, or unexplored in the paper, framed as concrete, actionable items for future research.
- Full 6-DoF evaluation is absent: ground truth limitations led to assessing only yaw and planar translations; a dataset or protocol enabling precise 6-DoF benchmarking (including pitch, roll, and z) is needed to validate metric pose estimation claims.
- Loop closure acceptance is weakly constrained: the system confirms closures solely on “valid RANSAC” without thresholds on inlier count, residuals, or uncertainty; investigate robust acceptance criteria (e.g., inlier ratio, reprojection/registration error, covariance estimation) to reduce false positives.
- Pose uncertainty is not modeled: no covariance or confidence score accompanies the 6-DoF estimates; develop uncertainty quantification for downstream SLAM validation, gating, and optimization.
- Translation errors remain large (≈8–14 m): identify why translation accuracy lags (e.g., sparse geometry, descriptor fusion limitations) and evaluate corrective steps such as depth-aware PnP, multi-view constraints, ICP refinement, or local map alignment; quantify improvements.
- ICP refinement is mentioned but not analyzed: provide controlled experiments assessing whether and when ICP (or other geometric refinement) reduces pose errors and runtime.
- Parameter sensitivity is unstudied: key hyperparameters (e.g., FAISS k, similarity threshold=0.90, Hungarian matching strategy, RANSAC correspondence distance=0.05 m, number of screened candidates) lack sensitivity analyses; characterize their impact on precision, pose accuracy, and runtime to guide robust defaults.
- Calibration and synchronization dependence: the pipeline assumes accurate camera–LiDAR calibration and timing for 2D→3D projection; quantify performance degradation under realistic calibration errors/time drift and explore self-calibration or calibration-robust matching.
- Scalability to very large maps is unclear: memory/runtime scaling for 8192-d SALAD descriptors and FAISS indexing is not reported beyond S3LI; evaluate performance with 106–107 frames, incremental indexing, compression (e.g., PQ), and memory footprint on rover-class hardware.
- Real-time viability on resource-constrained platforms: pose estimation (~3.1 s/query) may be prohibitive on rover onboard compute; paper CPU-only performance, energy usage, and end-to-end latency budgets, and propose accelerations (e.g., descriptor sparsification, early rejection, FPGA/ASIC paths).
- SLAM-back-end integration is not demonstrated: impact of MPRF loop closures on trajectory drift, map consistency, and robustness within a full multimodal SLAM system remains untested; perform closed-loop experiments and report long-term consistency and failure recovery.
- Failure mode characterization is limited: systematically analyze false matches and catastrophic outliers (e.g., aliasing, repetitive structure), identify root causes (visual vs geometric), and propose targeted mitigations (e.g., map priors, temporal consistency checks).
- LiDAR-only retrieval was dismissed without exploring stronger geometric priors: assess alternative LiDAR representations (e.g., range images, scan descriptors, submap-level retrieval), multi-scale geometry, or hybrid indexing that may benefit retrieval in texture-poor scenes.
- Fusion strategy is simplistic (descriptor concatenation): compare against learned fusion (cross-attention, gating), modality weighting (context-dependent), or metric-learning approaches to better exploit complementary appearance–geometry cues.
- Matching strategy may be brittle: one-to-one Hungarian matching could fail under occlusion/partial overlap; evaluate many-to-one/robust graph matching and ratio tests, and paper match pruning with geometric consistency constraints.
- Domain adaptation remains ad hoc: fine-tuning DINOv2 helped, retraining SALAD hurt; investigate principled adaptation (e.g., sparse supervision, self-training, continual learning) and conditions under which aggregation retraining helps or harms.
- Generalization remains under-explored: validation only on Mt. Etna and Vulcano; test across wider planetary analogs (sand/dust storms, extreme lighting, specular surfaces), different sensors, and trajectory profiles; report cross-domain drop and mitigation strategies.
- Evaluation metrics are narrow: retrieval is reported as P@k without recall, mAP, or calibration of similarity scores; add comprehensive retrieval metrics and calibration analyses (e.g., precision–recall curves, ECE for similarity).
- Descriptor and resolution choices are fixed: assess the effect of image resolution (beyond 224×224), ViT scale (e.g., ViT-L/H), patch size, and SONATA configurations on accuracy and runtime; explore adaptive downsampling and multi-resolution pipelines.
- Dataset overlap definition may bias results: yaw-based overlap with position correction may not capture complex 3D overlap under elevation changes; propose overlap metrics reflecting full 6-DoF visibility and occlusion, and re-label positives/negatives accordingly.
- Reliance on LiDAR depth for 2D→3D projection is restrictive: evaluate alternatives (stereo depth, monocular depth from foundation models) for camera-only deployments or limited LiDAR FOV; quantify accuracy trade-offs.
- Handling dynamics is unaddressed: assess robustness to moving elements (e.g., dust plumes, vegetation motion), seasonal changes, and transient artifacts; consider dynamic object filtering in fusion/matching.
- Memory and storage constraints are not quantified: report descriptor storage per frame, index size, and map growth implications; paper compression schemes and their impact on retrieval/pose accuracy.
- Comparative baselines are limited: include additional strong cross-modal matchers (e.g., SuperGlue/LightGlue with depth, modern 3D local descriptors, map-level registration) and recent foundation-model retrieval systems to strengthen empirical claims.
- Code and models are to be released: ensure reproducibility by documenting preprocessing, parameter settings, and dataset splits; provide scripts for calibration robustness tests and large-scale indexing experiments.
Practical Applications
Immediate Applications
The following applications can be deployed now by leveraging the released MPRF codebase and off-the-shelf camera–LiDAR hardware, along with standard SLAM back-ends and tooling such as FAISS, RANSAC/ICP, and ROS.
- Sector: Robotics (Planetary Analogs, Field Robotics)
- Use case: Robust loop closure and re-localization for autonomous rovers in visually weak, unstructured terrains (lava fields, deserts, polar regions).
- What emerges: A ROS2 node or SLAM plug-in that adds two-stage visual retrieval (DINOv2 + SALAD) and geometric 6-DoF verification (DINOv2 + SONATA + RANSAC/ICP) to existing SLAM stacks.
- Tools/workflows: Integration with ORB-SLAM3/VINS-Fusion/LIO-SAM/Cartographer; FAISS-powered place database; pose-validated loop closures fed to back-ends.
- Assumptions/dependencies: Calibrated, time-synchronized camera–LiDAR; sufficient LiDAR coverage; GPU for feature extraction; known extrinsics; map database management.
- Sector: Mining and Subterranean Robotics (Tunnels, Caves, Underground Facilities)
- Use case: Drift reduction and reliable loop closure for UGVs/UAVs in GPS-denied, repetitive corridors and low-texture shafts.
- What emerges: A subsystem for loop closure and global relocalization within existing autonomy stacks (e.g., DARPA SubT-style platforms).
- Tools/workflows: Batch map optimization with MPRF re-ranking by pose; safety logs with interpretable correspondences for after-action validation.
- Assumptions/dependencies: Dust/smoke may reduce LiDAR/vision quality; ruggedized sensors; consistent extrinsics over time.
- Sector: Construction, AEC, and Mobile Mapping
- Use case: Improved loop closure in repetitive interiors (corridors, parking garages) for backpack or trolley mappers to build digital twins.
- What emerges: An MPRF-based “loop-closure enhancer” that plugs into mobile mapping pipelines to reduce drift and rework.
- Tools/workflows: Offline FAISS indexing of site scans; pose-validated loop closures before bundle adjustment.
- Assumptions/dependencies: Consistent sensor calibration; controlled traversal overlap; adequate lighting or LiDAR density.
- Sector: Industrial Inspection and Energy (Refineries, Power Plants, Substations)
- Use case: Reliable relocalization for inspection robots in GNSS-denied, texture-poor metallic environments with aliasing.
- What emerges: A place-recognition module with explicit 6-DoF verification that supports repeatable inspection routes and change detection.
- Tools/workflows: Persistent place database per facility; threshold-based acceptance of loop closures; integration with digital twin CMMS systems.
- Assumptions/dependencies: Safety-compliant hardware; access to facility map database; periodic recalibration checks.
- Sector: Warehousing and Logistics
- Use case: Loop closure for AGVs/AMRs navigating long, look-alike aisles and pallet stacks where vision-only methods alias.
- What emerges: A reliability layer for re-localization with explainable correspondences for operations debugging and safety audits.
- Tools/workflows: Real-time FAISS store updated incrementally; RANSAC inlier statistics used for acceptance/rejection logic.
- Assumptions/dependencies: Camera–LiDAR availability; compute budget on the robot or edge server; floor-level calibration stability.
- Sector: Healthcare Facilities (Hospital Logistics Robots)
- Use case: Robust loop closure in bland, low-texture corridors to avoid map drift and reduce operator interventions.
- What emerges: Drop-in module for hospital navigation stacks with interpretable loop-closure validation.
- Tools/workflows: Night-time map refresh with offline MPRF reranking; operational alerts when loop-closure confidence drops.
- Assumptions/dependencies: Compliance with hospital privacy and safety; sensor cleaning/maintenance to ensure LiDAR coverage.
- Sector: Outdoor GNSS-challenged Operations (Forestry, Canyons, Under Canopy SAR)
- Use case: SLAM stabilization for UGV/UAV platforms where GNSS is intermittent and textures are repetitive (rocks, foliage).
- What emerges: Field-deployable multimodal loop-closure block that reduces accumulated drift during long traverses.
- Tools/workflows: Mission debrief pipelines using pose-validated matches; map stitching across sorties using FAISS and geometric checks.
- Assumptions/dependencies: Adequate LiDAR returns in foliage; resilient synchronization under vibration and temperature.
- Sector: AR/VR and Prosumer Scanning
- Use case: Better loop closure in consumer/prosumer scans of indoors with iOS devices that include LiDAR (white walls, hallways).
- What emerges: An SDK add-on for ARKit/ARCore that replaces or augments relocalization with SALAD + DINOv2 retrieval and fused 6-DoF verification.
- Tools/workflows: On-device or edge FAISS index; periodic geometric verification for anchor persistence.
- Assumptions/dependencies: Device LiDAR availability (iPad Pro/iPhone Pro); app-level access to intrinsics/extrinsics; energy constraints.
- Sector: Academia and Research
- Use case: A strong baseline for multimodal place recognition with explicit 6-DoF estimation on unstructured datasets.
- What emerges: Reproducible benchmarks and ablations on S3LI/S3LI-Vulcano; method comparisons for fusion strategies, aggregation (SALAD), and fine-tuning.
- Tools/workflows: Open-source code and models; experiment scripts; standardized evaluation (Precision@k, yaw/translation thresholds).
- Assumptions/dependencies: GPU access; dataset licensing; careful train/test traversal splits to avoid leakage.
- Sector: Safety and Operations (Org-level, pre-standards)
- Use case: Engineering process for validating loop closures using interpretable correspondences and RANSAC inlier stats before accepting constraints.
- What emerges: Internal guidance and test protocols for GNSS-denied navigation solutions in field robots.
- Tools/workflows: Audit trails storing matched patches/points and inlier sets; thresholded acceptance criteria tuned to risk tolerance.
- Assumptions/dependencies: No formal regulatory standard yet; relies on internal safety engineering practices.
Long-Term Applications
The following rely on further research, engineering for real-time/embedded constraints, domain adaptation, or standardization.
- Sector: Space/Planetary Missions
- Use case: Onboard, radiation-hardened, real-time multimodal loop closure for planetary rovers in dust, lighting extremes, and long traverses.
- What emerges: TRL-advanced MPRF variant with quantized/optimized DINOv2/SONATA, and low-power FAISS alternatives.
- Tools/workflows: In-flight map databases; autonomous loop-closure acceptance policies using uncertainty and inlier statistics.
- Assumptions/dependencies: Rad-hard compute; thermal/vibration robustness; rigorous verification/validation and fault tolerance.
- Sector: Autonomous Vehicles and Off-road Mobility
- Use case: Off-road AV mapping and relocalization in visually repetitive natural terrains; robust map updates in GNSS-poor areas.
- What emerges: Fusion module that complements lidar-odometry with foundation-model retrieval; closed-loop map maintenance.
- Tools/workflows: Multi-session FAISS indices; cross-season relocalization with adaptive thresholds; fleet-level map services.
- Assumptions/dependencies: Automotive-grade sensors; real-time latency budgets; long-term domain shifts (weather, seasons).
- Sector: Multi-robot and Swarm Mapping
- Use case: Cross-agent place recognition and pose-graph merging using compact global descriptors and pose-verified correspondences.
- What emerges: Distributed FAISS/index sharding; bandwidth-aware descriptor sharing; consensus-based geometric verification.
- Tools/workflows: Edge-cloud map fusion; conflict resolution using inlier statistics; collaborative SLAM back-ends.
- Assumptions/dependencies: Communication constraints; descriptor compression; time-sync across platforms.
- Sector: Beyond LiDAR–Vision Fusion (New Modalities)
- Use case: Robust loop closure in smoke/fog/night using radar–thermal–vision fusion; underwater mapping with sonar–vision analogs.
- What emerges: Foundation-model extensions for radar/sonar/thermal descriptors; cross-modal projection and fusion akin to DINOv2–SONATA.
- Tools/workflows: Modality-specific pretraining; multi-modal calibration toolchains; domain-specific RANSAC variants.
- Assumptions/dependencies: Availability of pretrained foundation backbones for new modalities; accurate multi-sensor extrinsics.
- Sector: Edge/Embedded Acceleration and Real-time Guarantees
- Use case: Deploy MPRF on embedded SoCs for small robots and drones with tight power/latency constraints.
- What emerges: INT8/FP8 quantized models; mixed-precision SALAD; learned or hardware-accelerated indexing; approximate geometric verification.
- Tools/workflows: Compiler toolchains (TensorRT, TVM); on-chip vector search; scheduler co-design with perception stack.
- Assumptions/dependencies: Accuracy retention after compression; deterministic latency; thermal envelopes.
- Sector: Standards, Certification, and Policy
- Use case: Safety cases and certification frameworks for explainable loop closures in GNSS-denied navigation.
- What emerges: Standardized metrics (inlier counts, residuals), datasets, and acceptance criteria for loop-closure constraints in safety-critical robots.
- Tools/workflows: Conformance test suites; logging formats for correspondences; procurement language requiring interpretable pose verification.
- Assumptions/dependencies: Multi-stakeholder consensus; public benchmarks; regulator engagement.
- Sector: Lifelong and Continual Mapping
- Use case: Long-term operations with environment changes (seasonal, structural) and hardware drift (sensor aging).
- What emerges: Continual fine-tuning pipelines for DINOv2-like backbones; self-supervised updates to descriptor spaces without catastrophic forgetting.
- Tools/workflows: Scheduled reindexing; drift-aware recalibration; active learning for challenging segments.
- Assumptions/dependencies: Data governance; compute for periodic retraining; safeguards against map corruption.
- Sector: Consumer AR Cloud and Large Indoor Navigation
- Use case: Persistent, privacy-preserving place recognition and relocalization across devices and sessions in malls, campuses, airports.
- What emerges: Cloud FAISS services with pose-verified anchors; cross-device calibration handling; map sharing with minimal raw image transfer.
- Tools/workflows: Federated descriptor aggregation; edge verification; anchor lifecycle management.
- Assumptions/dependencies: Reliable device sensors (including ToF/LiDAR or depth); privacy and data policies; scalable back-end.
- Sector: Cultural Heritage, Archaeology, and Hazardous Sites
- Use case: Robust mapping of caves/tunnels/ruins where textures are weak and GPS is unavailable; safe, repeatable scans.
- What emerges: Drone/UGV kits with MPRF-based mapping; provenance tracking via interpretable correspondences.
- Tools/workflows: Offline consolidation of multi-session scans; confidence-based acceptance to protect fragile sites.
- Assumptions/dependencies: Site permissions; careful sensor calibration; low-light robustness via LiDAR.
- Sector: Education and Workforce Training
- Use case: Teaching multimodal SLAM with explainable loop closures.
- What emerges: Course modules and lab kits built around MPRF and S3LI/S3LI-Vulcano; assignments on fusion, retrieval, and verification.
- Tools/workflows: Dockerized stacks; notebooks for ablation; simulated environments (Gazebo/Isaac Sim) with unstructured scenes.
- Assumptions/dependencies: GPU-enabled lab resources; dataset access; instructor expertise.
Cross-cutting assumptions and dependencies to consider
- Sensor calibration and synchronization: Accurate intrinsics/extrinsics and time alignment of camera–LiDAR are critical for projection and fusion.
- Compute resources: Current pipeline timings assume a desktop-class GPU; embedded deployment requires optimization.
- Environmental fit: Gains are strongest in unstructured, low-texture, or aliased environments; performance in highly dynamic scenes may require additional handling.
- Data management: FAISS index maintenance, map session handling, and reindexing strategies affect scalability and reliability.
- Validation policy: Loop-closure acceptance thresholds (e.g., inlier counts, residuals) should be tuned to the risk profile of the application.
- Licensing and model availability: Use of DINOv2, SALAD, SONATA, and the MPRF codebase must align with their respective licenses; pretrained weights for target modalities may be required.
Glossary
- 6-DoF: Six degrees of freedom; a full 3D pose comprising three translations and three rotations. "explicit 6-DoF pose estimation"
- approximate nearest-neighbor search: An efficient technique to find close vector matches in high-dimensional spaces. "approximate nearest- neighbor search [38]"
- CLS token: A special transformer token used as a global representation of an input image. "DINOv2 (b) (CLS Token)"
- cosine similarity: A vector similarity measure based on the cosine of the angle between two embeddings. "compared using cosine similarity"
- D-GNSS: Differential Global Navigation Satellite System; a high-precision localization method using corrections to GNSS signals. "D-GNSS measurements"
- DINOv2: A self-supervised Vision Transformer that produces robust patch-level image descriptors. "We employ DINOv2 [15]"
- DSAC: Differentiable RANSAC; a learning framework integrating RANSAC into neural training for pose estimation. "DSAC [33]"
- FAISS: Facebook AI Similarity Search; a library for fast similarity search and clustering of dense vectors. "FAISS (Facebook AI Similarity Search)"
- FPFH: Fast Point Feature Histograms; a hand-crafted local 3D point cloud descriptor for registration. "FPFH + RANSAC"
- FoundPose: A foundation model-based estimator for unseen object pose using DINOv2 features and PnP+RANSAC. "FoundPose [34]"
- GeM: Generalized Mean Pooling; a pooling strategy that improves compactness and retrieval accuracy. "pooling strategies such as GeM [11]"
- GNSS-denied environments: Scenarios where satellite navigation signals are unavailable or unreliable. "GNSS-denied environments"
- ICP: Iterative Closest Point; an algorithm that refines rigid alignment between point clouds. "PnP+RANSAC and ICP."
- LiDAR: Light Detection and Ranging; a sensor producing 3D point clouds via laser scanning. "LiDAR pointcloud"
- LoFTR: A detector-free transformer-based dense feature matcher for visual correspondence. "LoFTR [31]"
- MLP-Mixer: An architecture that mixes features using multilayer perceptrons instead of convolution or attention. "MLP-Mixer architectures like Mix VPR [13]"
- MinkLoc3D: A point cloud-based large-scale place recognition model using sparse convolutions. "MinkLoc3D [22]"
- MinkLoc++: A multimodal place recognition method fusing LiDAR and monocular images. "MinkLoc++ underperforms compared to visual-only"
- NetVLAD: A CNN with a differentiable VLAD layer for place recognition. "NetVLAD [8] introducing a differentiable VLAD layer"
- Optimal transport clustering: A clustering approach leveraging optimal transport to aggregate local descriptors. "optimal transport clustering"
- PCA: Principal Component Analysis; a dimensionality reduction technique often used for visualization. "PCA colored"
- PnP: Perspective-n-Point; an algorithm to estimate camera pose from 2D–3D correspondences. "PnP+RANSAC"
- Precision@1: A retrieval metric indicating whether the top-ranked candidate is correct. "Precision@1"
- RANSAC: Random Sample Consensus; a robust estimator that fits models by rejecting outliers through sampling. "RANSAC-based point-to-point registration"
- SALAD: Sinkhorn Algorithm for Locally Aggregated Descriptors; an optimal transport-based global descriptor aggregator. "SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors)"
- SE(3): The Lie group of 3D rigid transformations (rotation and translation). "SE(3)"
- SONATA: A self-supervised point cloud transformer yielding reliable multi-scale 3D descriptors. "SONATA [25]"
- Trans VPR: A transformer-based place recognition model with multi-level attention aggregation. "Trans VPR [16]"
- Triplet margin loss: A metric learning objective encouraging an anchor to be closer to a positive than a negative by a margin. "triplet margin loss (m = 0.2)"
- ViT-B/14: Vision Transformer Base with 14×14 patch size, used as the DINOv2 backbone. "ViT-B/14 DINOv2 backbone"
- VLAD: Vector of Locally Aggregated Descriptors; aggregates residuals of local features to cluster centers. "VLAD-style clustering"
- VPR: Visual Place Recognition; matching images to places under viewpoint or appearance changes. "Transformers have significantly advanced VPR"
- yaw: Rotation around the vertical axis, used here as an angular component in pose evaluation. "yaw-based angular alignment"
Collections
Sign up for free to add this paper to one or more collections.