Open-vocabulary Semantic SLAM
- Open-vocabulary Semantic SLAM is a system that incrementally builds 3D maps while dynamically associating free-form natural language labels to regions and objects.
- It leverages advanced vision-language models like CLIP and LLMs to enable zero-shot or few-shot inference for novel, previously unseen object categories.
- This integrated approach facilitates interactive, real-time robotic navigation, manipulation, and AR/VR applications by uniting geometric reconstruction with semantic context.
Open-vocabulary Semantic SLAM refers to the integration of Simultaneous Localization and Mapping (SLAM) with open-vocabulary semantic understanding, where a robotic agent incrementally builds a 3D map while assigning semantic categories to 3D regions, objects, or voxels drawn from an unbounded set of free-form natural language labels. Instead of being constrained to a predefined, closed set of object classes, open-vocabulary Semantic SLAM enables zero-shot or few-shot semantic inference by leveraging vision-LLMs (VLMs) such as CLIP, LLMs, and foundation segmentation models (e.g., SAM). The objective is not only geometric mapping and localization, but context-rich, object- or region-centric scene representations that can be queried or reasoned about using arbitrary linguistic prompts, which is critical for robotic interaction in complex, novel, and dynamic environments.
1. Core Concepts and Motivations
Open-vocabulary Semantic SLAM extends classical SLAM by enabling dense or object-centric 3D mapping tightly coupled with unconstrained semantic understanding, grounded in large pretrained vision-LLMs. Rather than maintaining a set of fixed semantic categories, the system represents 3D entities—points, voxels, objects, or map subgraphs—with language-embedded features, directly allowing text queries or natural language interactions. This capability is essential for:
- Navigation and manipulation in previously unseen or dynamic environments, where objects and regions may have never been encountered or labeled in training;
- Supporting zero-shot or few-shot object and region retrieval via language queries;
- Enabling lifelong and open-set scene understanding where the set of categories grows dynamically;
- Real-time integration on robotics, AR/VR, and embodied AI systems.
Traditional SLAM systems (e.g., ORB-SLAM2) are limited to mapping geometry and camera poses. Closed-set semantic extensions fuse outputs from per-pixel segmentation networks trained on a small set of classes, which is inadequate for the diversity and flexibility required by autonomous agents in open worlds. Open-vocabulary approaches address these limitations by embedding vision-language context into every step of the SLAM pipeline (Deng et al., 27 Mar 2026, Nasser et al., 1 Dec 2025, Laina et al., 11 Apr 2025).
2. Architectures and Representations
Most open-vocabulary semantic SLAM systems, such as OVI-MAP (Deng et al., 27 Mar 2026), OpenVox (Deng et al., 23 Feb 2025), LEGO-SLAM (Lee et al., 20 Nov 2025), FindAnything (Laina et al., 11 Apr 2025), and KM-ViPE (Nasser et al., 1 Dec 2025), build on the following architectural backbone:
- Geometric Backbone: Incremental 3D mapping using RGB-D SLAM, visual-inertial SLAM, monocular depth prediction, or 3D Gaussian Splatting (3DGS) (Yoo et al., 9 Dec 2025, Lee et al., 20 Nov 2025).
- Volumetric or Object-Centric Map: The map maintains either voxel grids (TSDF, occupancy, or implicit neural fields (Tie et al., 2024)), super-points (object-centric clusters), 3D Gaussian primitives, or hybrid scene graphs (Pan et al., 18 Jun 2025).
- Semantic Feature Fields: Semantic information is injected via per-voxel, per-object, or per-Gaussian vision-language features (CLIP, LLM-encoded captions, Transformer-based embeddings) or accumulating weights in high-dimensional feature spaces.
- Instance Segmentation and Tracking: 2D mask proposal and object-centric segmentation are obtained from generalist segmentors (SAM, eSAM, CropFormer, Yolo-world+TAP (Deng et al., 23 Feb 2025)), and then lifted to 3D using depth and pose.
- LLM Integration: Features from image-text models are extracted on dynamically selected informative views or segments (Deng et al., 27 Mar 2026, Martins et al., 2024), and aggregated via learned or heuristic fusion.
- Data Association and Dynamic Fusion: Probabilistic instance assignment combines geometric, semantic, and temporal information (Deng et al., 23 Feb 2025, Pan et al., 18 Jun 2025), often under Bayes or MAP/ML decompositions.
A key trend is the decoupling of 3D geometric reconstruction from semantic inference, reducing computational load by triggering expensive vision-language queries only for a sparse, dynamically selected set of keyviews or object centroids (Deng et al., 27 Mar 2026, Martins et al., 2024).
3. Semantic Embedding, Fusion, and Query Mechanisms
Semantic Feature Extraction and Fusion
Semantic information can be injected at various levels:
- Per-pixel or Per-voxel: CLIP or DINOv2 visual features stored at each voxel (Deng et al., 27 Mar 2026, Nasser et al., 1 Dec 2025, Laina et al., 11 Apr 2025, Tie et al., 2024).
- Instance-level: Each 3D object/segment maintains a fused vision-language embedding, typically accumulated via weighted averaging across selected or all views (Deng et al., 27 Mar 2026, Deng et al., 23 Feb 2025).
- Caption Augmentation: Incorporating LLM-encoded captions to disambiguate instances beyond pure vision-language image features (Deng et al., 23 Feb 2025).
For object- or segment-level tracking, multi-view fusion is implemented by either weighted averaging, selecting the most representative view, or aggregating via per-dimension attention (transformer-MLP merging (Martins et al., 2024)). Fusion mechanisms are designed to support robustness against noise and cross-view inconsistency, with techniques like adaptive voxel splitting and multi-view voting improving segmentation stability (Tie et al., 2024, Yoo et al., 9 Dec 2025).
Open-vocabulary Query Resolution
Queries are supported as follows:
- Map segments are semantically matched to input text via cosine similarity between the segment’s CLIP (or caption) feature and the embedding of a text prompt (Deng et al., 27 Mar 2026, Laina et al., 11 Apr 2025, Deng et al., 23 Feb 2025).
- Softmax with temperature scaling is common for converting similarity to probabilities (Deng et al., 27 Mar 2026).
- For region- or scene-level planning, object-level scores are composed into utility measures for exploration or manipulation (Laina et al., 11 Apr 2025, Qiu et al., 2024).
- LLMs may be leveraged for spatial region abstraction or for reasoning over scene graphs and object semantics for navigation and manipulation tasks (Qiu et al., 2024, Xie et al., 17 Jul 2025).
The queryability and compositionality of these semantic maps enable downstream robot behaviors including search, navigation, manipulation, and dialogue-based interaction.
4. Data Association, Tracking, and Real-Time Performance
Correctly associating 2D segmentations with persistent 3D instances is essential for temporal semantic consistency and robust tracking. OVI-MAP (Deng et al., 27 Mar 2026) employs voting and label buffers for super-point instances in voxel maps, with spatial overlap and geometric refinement for merging. OpenVox (Deng et al., 23 Feb 2025) uses a Dirichlet-categorical probabilistic representation for voxel-to-instance association, fusing geometric support and caption feature similarity.
Supporting real-time operation is achieved via:
- Parallel processing pipelines for segmentation, mapping, and language inference (Deng et al., 27 Mar 2026);
- Sparse updates and focus on object- or region-centric representations, rather than dense per-pixel features;
- Scene-adaptive feature compression, such as LEGO-SLAM’s 16-D language-embedded Gaussians, which reduces per-element memory and bandwidth by 32× relative to full CLIP fields (Lee et al., 20 Nov 2025);
- Informative-view selection heuristics for semantic feature extraction, often yielding dramatic reductions in LLM calls (e.g., a 53% decrease in VLM queries in OVI-MAP (Deng et al., 27 Mar 2026)).
Real-time SLAM throughput (≥15–30 FPS) is demonstrated on mainstream GPUs even for resource-constrained settings such as micro-drones (Laina et al., 11 Apr 2025), with system latencies dominated by segmentation and LLM inference.
5. Quantitative Evaluation and Benchmarks
Open-vocabulary semantic SLAM performance is conventionally evaluated on public benchmarks such as Replica and ScanNet. Metrics include mean Intersection over Union (mIoU), mean accuracy (mAcc), and average precision (AP) at multiple IoU thresholds. Recent systems report:
| System | mIoU (Replica) | mAcc (Replica) | Real-time | Comments |
|---|---|---|---|---|
| OVI-MAP | 27% | --- | Yes (30 FPS) | Outperforms OVO-SLAM and Mask3D |
| OVO-SLAM | 25% | --- | Yes (~14 FPS) | Online pipeline, loop closure |
| OpenVox | 27.3% | 43.4% | Yes (20–30 FPS) | State-of-the-art open-vocab AP |
| FindAnything | --- | 48.8% | Yes | Deformable submaps, MAV support |
| LEGO-SLAM | --- | --- | Yes (15 FPS) | 3DGS, compact features |
| KM-ViPE | 3.8% | 10.9% | Yes (8 FPS) | Fully monocular, no depth |
Ablation studies confirm that advanced multi-view fusion and language-aware compression are crucial, e.g., feature fusion by pixel count outperforms clustering heuristics ((Deng et al., 27 Mar 2026), Table 6).
In real robotic deployments, open-vocabulary systems enable robust search, navigation, and manipulation, with high room and object retrieval rates, strong success weighted by path length (SPL), and tolerance to scene dynamics and unmapped objects (Qiu et al., 2024, Xie et al., 17 Jul 2025). Real-time mobile manipulation experiments achieve 80.95% navigation and 73.33% task success (Qiu et al., 2024).
6. Limitations, Open Challenges, and Directions
Despite substantial advances, several limitations persist:
- Dependence on 2D segmentor accuracy: Small, textureless, or heavily occluded objects present segmentation challenges; improvements in the underlying segmentation backbones lead to significant AP and mIoU gains (Deng et al., 27 Mar 2026, Martins et al., 2024).
- Semantic embedding drift: VLM features may be suboptimal under occlusion, background clutter, or prompt ambiguity; misaligned or noisy language embeddings remain a bottleneck (Deng et al., 27 Mar 2026, Deng et al., 23 Feb 2025).
- Memory and scalability: Storing high-dimensional features for every voxel or Gaussian limits scalability, addressed via feature distillation and pruning but still an area of active research (Lee et al., 20 Nov 2025, Yoo et al., 9 Dec 2025).
- Dynamic scenes and lifelong operation: Handling dynamic objects, map updates for moved or unmapped entities, and cross-scene memory management are nontrivial (Nasser et al., 1 Dec 2025, Xie et al., 17 Jul 2025). Approaches range from adaptive robust optimization to text-based map abstraction and LLM-driven belief updates.
- Tail-class and open-set generalization: Recognition of rare or completely novel categories is still limited by the pretraining distribution of VLMs and LLMs, with semantic accuracy decaying for classes far from training distribution (Nasser et al., 1 Dec 2025, Martins et al., 2024).
Current research extends open-vocabulary SLAM toward richer prompt engineering (scene graphs, object relations), robust handling of dynamics via super-point temporal coherence, uncertainty modeling for informed exploration, and integration with LLMs for semantic region abstraction, reasoning, and interaction (Qiu et al., 2024, Xie et al., 17 Jul 2025).
7. Impact and Applications
Open-vocabulary semantic SLAM is central to state-of-the-art embodied AI, enabling:
- Zero-shot navigation, language-driven exploration, and object relocation by combining semantic maps with LLM-based reasoning (Qiu et al., 2024, Xie et al., 17 Jul 2025);
- Fine-grained, interactive scene understanding and manipulation (finding, grasping, and fetching) based on free-form user instructions;
- Real-time scene labeling, region abstraction for AR/VR overlays, and context-aware robot-human collaboration;
- Robotics applications in diverse, dynamic, and open-world environments, including indoor/outdoor, ego-centric, and mobile manipulation scenarios (Yoo et al., 9 Dec 2025, Pan et al., 18 Jun 2025, Nasser et al., 1 Dec 2025).
These systems currently outperform closed-set or class-restricted semantic SLAM on both standard benchmarks and real-world deployments, providing scalable, robust, and versatile 3D scene representations that can serve as a foundation for the next generation of autonomous perception and reasoning systems (Deng et al., 27 Mar 2026, Lee et al., 20 Nov 2025, Deng et al., 23 Feb 2025).