Papers
Topics
Authors
Recent
Search
2000 character limit reached

Open-vocabulary Semantic SLAM

Updated 4 April 2026
  • Open-vocabulary Semantic SLAM is a system that incrementally builds 3D maps while dynamically associating free-form natural language labels to regions and objects.
  • It leverages advanced vision-language models like CLIP and LLMs to enable zero-shot or few-shot inference for novel, previously unseen object categories.
  • This integrated approach facilitates interactive, real-time robotic navigation, manipulation, and AR/VR applications by uniting geometric reconstruction with semantic context.

Open-vocabulary Semantic SLAM refers to the integration of Simultaneous Localization and Mapping (SLAM) with open-vocabulary semantic understanding, where a robotic agent incrementally builds a 3D map while assigning semantic categories to 3D regions, objects, or voxels drawn from an unbounded set of free-form natural language labels. Instead of being constrained to a predefined, closed set of object classes, open-vocabulary Semantic SLAM enables zero-shot or few-shot semantic inference by leveraging vision-LLMs (VLMs) such as CLIP, LLMs, and foundation segmentation models (e.g., SAM). The objective is not only geometric mapping and localization, but context-rich, object- or region-centric scene representations that can be queried or reasoned about using arbitrary linguistic prompts, which is critical for robotic interaction in complex, novel, and dynamic environments.

1. Core Concepts and Motivations

Open-vocabulary Semantic SLAM extends classical SLAM by enabling dense or object-centric 3D mapping tightly coupled with unconstrained semantic understanding, grounded in large pretrained vision-LLMs. Rather than maintaining a set of fixed semantic categories, the system represents 3D entities—points, voxels, objects, or map subgraphs—with language-embedded features, directly allowing text queries or natural language interactions. This capability is essential for:

  • Navigation and manipulation in previously unseen or dynamic environments, where objects and regions may have never been encountered or labeled in training;
  • Supporting zero-shot or few-shot object and region retrieval via language queries;
  • Enabling lifelong and open-set scene understanding where the set of categories grows dynamically;
  • Real-time integration on robotics, AR/VR, and embodied AI systems.

Traditional SLAM systems (e.g., ORB-SLAM2) are limited to mapping geometry and camera poses. Closed-set semantic extensions fuse outputs from per-pixel segmentation networks trained on a small set of classes, which is inadequate for the diversity and flexibility required by autonomous agents in open worlds. Open-vocabulary approaches address these limitations by embedding vision-language context into every step of the SLAM pipeline (Deng et al., 27 Mar 2026, Nasser et al., 1 Dec 2025, Laina et al., 11 Apr 2025).

2. Architectures and Representations

Most open-vocabulary semantic SLAM systems, such as OVI-MAP (Deng et al., 27 Mar 2026), OpenVox (Deng et al., 23 Feb 2025), LEGO-SLAM (Lee et al., 20 Nov 2025), FindAnything (Laina et al., 11 Apr 2025), and KM-ViPE (Nasser et al., 1 Dec 2025), build on the following architectural backbone:

  • Geometric Backbone: Incremental 3D mapping using RGB-D SLAM, visual-inertial SLAM, monocular depth prediction, or 3D Gaussian Splatting (3DGS) (Yoo et al., 9 Dec 2025, Lee et al., 20 Nov 2025).
  • Volumetric or Object-Centric Map: The map maintains either voxel grids (TSDF, occupancy, or implicit neural fields (Tie et al., 2024)), super-points (object-centric clusters), 3D Gaussian primitives, or hybrid scene graphs (Pan et al., 18 Jun 2025).
  • Semantic Feature Fields: Semantic information is injected via per-voxel, per-object, or per-Gaussian vision-language features (CLIP, LLM-encoded captions, Transformer-based embeddings) or accumulating weights in high-dimensional feature spaces.
  • Instance Segmentation and Tracking: 2D mask proposal and object-centric segmentation are obtained from generalist segmentors (SAM, eSAM, CropFormer, Yolo-world+TAP (Deng et al., 23 Feb 2025)), and then lifted to 3D using depth and pose.
  • LLM Integration: Features from image-text models are extracted on dynamically selected informative views or segments (Deng et al., 27 Mar 2026, Martins et al., 2024), and aggregated via learned or heuristic fusion.
  • Data Association and Dynamic Fusion: Probabilistic instance assignment combines geometric, semantic, and temporal information (Deng et al., 23 Feb 2025, Pan et al., 18 Jun 2025), often under Bayes or MAP/ML decompositions.

A key trend is the decoupling of 3D geometric reconstruction from semantic inference, reducing computational load by triggering expensive vision-language queries only for a sparse, dynamically selected set of keyviews or object centroids (Deng et al., 27 Mar 2026, Martins et al., 2024).

3. Semantic Embedding, Fusion, and Query Mechanisms

Semantic Feature Extraction and Fusion

Semantic information can be injected at various levels:

For object- or segment-level tracking, multi-view fusion is implemented by either weighted averaging, selecting the most representative view, or aggregating via per-dimension attention (transformer-MLP merging (Martins et al., 2024)). Fusion mechanisms are designed to support robustness against noise and cross-view inconsistency, with techniques like adaptive voxel splitting and multi-view voting improving segmentation stability (Tie et al., 2024, Yoo et al., 9 Dec 2025).

Open-vocabulary Query Resolution

Queries are supported as follows:

The queryability and compositionality of these semantic maps enable downstream robot behaviors including search, navigation, manipulation, and dialogue-based interaction.

4. Data Association, Tracking, and Real-Time Performance

Correctly associating 2D segmentations with persistent 3D instances is essential for temporal semantic consistency and robust tracking. OVI-MAP (Deng et al., 27 Mar 2026) employs voting and label buffers for super-point instances in voxel maps, with spatial overlap and geometric refinement for merging. OpenVox (Deng et al., 23 Feb 2025) uses a Dirichlet-categorical probabilistic representation for voxel-to-instance association, fusing geometric support and caption feature similarity.

Supporting real-time operation is achieved via:

  • Parallel processing pipelines for segmentation, mapping, and language inference (Deng et al., 27 Mar 2026);
  • Sparse updates and focus on object- or region-centric representations, rather than dense per-pixel features;
  • Scene-adaptive feature compression, such as LEGO-SLAM’s 16-D language-embedded Gaussians, which reduces per-element memory and bandwidth by 32× relative to full CLIP fields (Lee et al., 20 Nov 2025);
  • Informative-view selection heuristics for semantic feature extraction, often yielding dramatic reductions in LLM calls (e.g., a 53% decrease in VLM queries in OVI-MAP (Deng et al., 27 Mar 2026)).

Real-time SLAM throughput (≥15–30 FPS) is demonstrated on mainstream GPUs even for resource-constrained settings such as micro-drones (Laina et al., 11 Apr 2025), with system latencies dominated by segmentation and LLM inference.

5. Quantitative Evaluation and Benchmarks

Open-vocabulary semantic SLAM performance is conventionally evaluated on public benchmarks such as Replica and ScanNet. Metrics include mean Intersection over Union (mIoU), mean accuracy (mAcc), and average precision (AP) at multiple IoU thresholds. Recent systems report:

System mIoU (Replica) mAcc (Replica) Real-time Comments
OVI-MAP 27% --- Yes (30 FPS) Outperforms OVO-SLAM and Mask3D
OVO-SLAM 25% --- Yes (~14 FPS) Online pipeline, loop closure
OpenVox 27.3% 43.4% Yes (20–30 FPS) State-of-the-art open-vocab AP
FindAnything --- 48.8% Yes Deformable submaps, MAV support
LEGO-SLAM --- --- Yes (15 FPS) 3DGS, compact features
KM-ViPE 3.8% 10.9% Yes (8 FPS) Fully monocular, no depth

Ablation studies confirm that advanced multi-view fusion and language-aware compression are crucial, e.g., feature fusion by pixel count outperforms clustering heuristics ((Deng et al., 27 Mar 2026), Table 6).

In real robotic deployments, open-vocabulary systems enable robust search, navigation, and manipulation, with high room and object retrieval rates, strong success weighted by path length (SPL), and tolerance to scene dynamics and unmapped objects (Qiu et al., 2024, Xie et al., 17 Jul 2025). Real-time mobile manipulation experiments achieve 80.95% navigation and 73.33% task success (Qiu et al., 2024).

6. Limitations, Open Challenges, and Directions

Despite substantial advances, several limitations persist:

Current research extends open-vocabulary SLAM toward richer prompt engineering (scene graphs, object relations), robust handling of dynamics via super-point temporal coherence, uncertainty modeling for informed exploration, and integration with LLMs for semantic region abstraction, reasoning, and interaction (Qiu et al., 2024, Xie et al., 17 Jul 2025).

7. Impact and Applications

Open-vocabulary semantic SLAM is central to state-of-the-art embodied AI, enabling:

  • Zero-shot navigation, language-driven exploration, and object relocation by combining semantic maps with LLM-based reasoning (Qiu et al., 2024, Xie et al., 17 Jul 2025);
  • Fine-grained, interactive scene understanding and manipulation (finding, grasping, and fetching) based on free-form user instructions;
  • Real-time scene labeling, region abstraction for AR/VR overlays, and context-aware robot-human collaboration;
  • Robotics applications in diverse, dynamic, and open-world environments, including indoor/outdoor, ego-centric, and mobile manipulation scenarios (Yoo et al., 9 Dec 2025, Pan et al., 18 Jun 2025, Nasser et al., 1 Dec 2025).

These systems currently outperform closed-set or class-restricted semantic SLAM on both standard benchmarks and real-world deployments, providing scalable, robust, and versatile 3D scene representations that can serve as a foundation for the next generation of autonomous perception and reasoning systems (Deng et al., 27 Mar 2026, Lee et al., 20 Nov 2025, Deng et al., 23 Feb 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Open-vocabulary Semantic SLAM.