Semantic Mapping in Indoor Embodied AI -- A Survey on Advances, Challenges, and Future Directions

Published 10 Jan 2025 in cs.RO and cs.CV | (2501.05750v3)

Abstract: Intelligent embodied agents (e.g. robots) need to perform complex semantic tasks in unfamiliar environments. Among many skills that the agents need to possess, building and maintaining a semantic map of the environment is most crucial in long-horizon tasks. A semantic map captures information about the environment in a structured way, allowing the agent to reference it for advanced reasoning throughout the task. While existing surveys in embodied AI focus on general advancements or specific tasks like navigation and manipulation, this paper provides a comprehensive review of semantic map-building approaches in embodied AI, specifically for indoor navigation. We categorize these approaches based on their structural representation (spatial grids, topological graphs, dense point-clouds or hybrid maps) and the type of information they encode (implicit features or explicit environmental data). We also explore the strengths and limitations of the map building techniques, highlight current challenges, and propose future research directions. We identify that the field is moving towards developing open-vocabulary, queryable, task-agnostic map representations, while high memory demands and computational inefficiency still remaining to be open challenges. This survey aims to guide current and future researchers in advancing semantic mapping techniques for embodied AI systems.

Abstract PDF Upgrade to Chat

Summary

The paper systematically categorizes map-building methods by evaluating spatial structures and semantic encoding techniques.
It demonstrates how integrating vision-language models enables zero-shot generalization and flexible, open-vocabulary map representations.
The survey identifies critical challenges such as scalability, real-time processing, and standardized evaluation, guiding future research.

Semantic Mapping in Indoor Embodied AI: Advances, Challenges, and Future Directions

Overview and Motivation

Semantic mapping is a foundational capability enabling embodied agents—including robots and virtual agents—to operate meaningfully in complex indoor environments. The surveyed paper systematically reviews semantic map-building methods for indoor embodied AI, centering on their structural representations (spatial grids, topological graphs, dense geometric, hybrid) and semantic encoding (explicit or implicit information). This perspective is distinct from prior work that mostly ties map-building to particular downstream applications, instead focusing on the underlying properties of map representations themselves. The survey further analyzes methodological connections to classical robotics, particularly SLAM, and underlines pressing challenges in scalability, efficiency, and evaluation.

Semantic maps encode not only spatial geometry, but also high-level semantic content such as object categories and affordances. This is crucial for agents tasked with long-horizon reasoning, planning, navigation, and manipulation in open-world or dynamic indoor environments. Building and maintaining such maps demand sophisticated solutions positioned at the intersection of vision, language, memory, and multimodal sensor fusion.

Figure 1: The survey categorizes semantic map-building methods in embodied agents based on their map structure (e.g., spatial grid, topological, dense geometric, or hybrid) and the nature of semantic encoding (explicit or implicit information).

Taxonomy of Semantic Map Representations

Map Structure: Spatial, Topological, Dense Geometric, Hybrid

The survey lays out a principled taxonomy based on map structure:

Spatial grid maps are typically 2D or 3D discrete grids aligned with the physical environment, maintaining high-resolution spatial fidelity. They are widely used for navigation and spatial reasoning tasks, although they pose challenges in scalability and memory utilization.
Topological maps abstract the environment as a graph, with nodes representing landmarks or important regions and edges denoting spatial or semantic relationships. These are memory-efficient and scale well with environment size but often lack detailed global context.
Dense geometric maps store semantic and geometric information at each 3D point (as in point clouds or Gaussian splats), offering the most spatial detail and supporting per-point semantic reasoning; they enable open-vocabulary, multi-task functionality but face high memory and computational costs.
Hybrid maps attempt to combine the strengths of multiple structures, e.g., using a topometric approach with both grid and landmark-based representations. Such combinations can support spatially precise planning while enabling abstract, semantic reasoning.
Figure 3: Grid map building: spatial grid maps are constructed from segmented sensory input, projected and registered to a global allocentric frame, and updated over time.

Figure 4: Timeline showing increased diversity in map structures and a recent shift toward open-vocabulary semantic maps leveraging large vision/LLMs.

Semantic Encoding: Explicit vs. Implicit

Encodings are categorized as:

Explicit encoding stores predefined quantities per cell or node—such as occupancy, object class labels, visited status, or audio levels. These are interpretable and support fine-tuning to specific tasks, but are limited to a preset vocabulary.
Implicit encoding assigns latent features learned by neural models, including vision or vision-language foundation model embeddings (e.g., CLIP, BLIP-2). Implicit features can be either closed- or open-vocabulary and support generalized and task-agnostic querying, but they may be less interpretable and harder to assess directly.

The paper documents a decisive recent trend: moving from closed-vocabulary, task-specific encodings to open-vocabulary, queryable, and general-purpose map representations via integration of foundation models.

Figure 6: Map encoding describes whether the map’s stored values are explicit (interpretable, pre-selected quantities) or implicit (latently learned feature representations).

Relation to SLAM and Embodied System Design

The survey clarifies the interplay between traditional SLAM and semantic mapping for embodied AI. While SLAM in robotics emphasizes precise geometric mapping and localization, Semantic SLAM and recent embodied AI approaches increasingly integrate semantics for higher-level reasoning, often relaxing strict localization for efficiency and abstraction.

Additionally, the architectural design of embodied AI systems influences map-building:

End-to-end approaches learn direct sensory-to-control mappings, sometimes with differentiable memory modules but generally lacking structured spatial/semantic memory.
Modular pipelines decouple visual encoding, mapping, exploration, and planning, enabling reusability and interpretability at the cost of greater system complexity.

Figure 7: SLAM’s core tasks (mapping, localization, planning) are tightly interwoven, with modular or end-to-end architectures possible for embodied AI systems.

Figure 8: End-to-end (top) versus modular (bottom) embodied system architectures.

Open-Vocabulary and Multimodal Semantic Mapping

A major advance highlighted by the survey is the adoption of open-vocabulary, queryable semantic maps via large foundation models. These representations, built using models like CLIP, LSeg, OWL-ViT, and BLIP-2, permit robots to represent and retrieve semantic information beyond fixed object category sets, supporting flexible querying and generalization to novel instructions or visual targets.

Recent methods structure open-vocabulary maps as spatial grids, topological graphs, or dense 3D/point cloud representations. Aggregation and querying mechanisms leverage feature similarity scores, often integrating high-level task planners powered by LLMs, allowing robust cross-modal reasoning.

Figure 2: Open-vocabulary map building enables agents to build a reusable map for subsequent downstream tasks during inference, as demonstrated by VLMap and NLMap.

Key Numerical and Methodological Results

The survey presents several critical insights, including:

Open-vocabulary mapping approaches, when paired with off-the-shelf vision-LLMs, can yield zero-shot generalization to object categories and language queries unseen during training, with competitive performance on navigation and manipulation tasks [gadre2023cows, huang2023visual, conceptgraphs].
Dense geometric maps (via point clouds or Gaussians) enable fine-grained 3D reasoning but struggle to meet the requirements for real-time, memory-efficient operation in large or highly dynamic spaces [conceptfusion, qiu2024open].
Hybrid and hierarchical scene representations improve scalability and abstraction, but efficient mechanisms for dynamic scene updating and cross-structure querying are yet unresolved [Rosinol20rss-dynamicSceneGraphs, hughes2024foundations].

Evaluation: Intrinsic and Extrinsic Metrics

Evaluation of semantic maps is widely recognized as a challenge:

Extrinsic evaluation (task utility) dominates, with metrics such as Success Rate, SPL, and nDTW for navigation, or task success rate in manipulation.
Intrinsic evaluation (direct map assessment) is less mature. The survey stresses the need for standardized metrics across accuracy, completeness, consistency, and robustness—especially for open-vocabulary, multimodal, and dynamic scene representations.

This gap is explicitly flagged as a major roadblock for method comparison and progress in the field.

Current Challenges

The primary technical challenges identified include:

Scalability and Efficiency: Scaling to larger, denser, or persistently dynamic environments is limited by memory and compute—especially for dense or open-vocabulary maps.
Real-Time Processing: Integrating fast semantic perception with spatial updating is currently beyond the capabilities of most map-building methods in robotics-scale environments.
Noise, Uncertainty, and Robustness: Most embodied AI work presumes idealized sensing; robust uncertainty modeling (in simulation and reality) remains underexplored.
Lifelong and Dynamic Adaptation: Current maps assume static or quasi-static worlds; efficient methods for updating semantic and geometric information over long operation are missing.
Multimodal Fusion: Aligning information from vision, language, audio, and other senses for robust, queryable, and actionable memory remains unresolved.
Standardized Intrinsic Evaluation: There is no accepted, broad-coverage benchmark suite for intrinsic map assessment, especially for dynamic, open-vocabulary embodied scenarios.

Implications and Discussion

The reviewed trends and challenges indicate pivotal shifts for embodied AI and robotics research:

Generalization and Flexibility: Agents increasingly leverage reusable, open-vocabulary semantic maps capable of supporting multiple downstream tasks and complex instruction following without retraining.
Task-Agnostic Representations: The trend is toward rich, queryable world models, unifying geometric and semantic memory with cross-modal access.
Integration of Foundation Models: Leveraging vision-LLMs (and LLMs for planning) is enabling stronger compositional, generalizable, and zero-shot task success; efficient, real-time adaptation of these architectures remains a challenge.
Hybrid Hierarchical Maps: Combining metric/grid and relational/scene-graph information, possibly in a layered hierarchy, may offer the best trade-offs between abstraction, scalability, and dense reasoning—provided computational and update challenges are addressed.
Figure 5: Motivation—Building and maintaining a semantic map is central for agents to reason and plan efficiently during indoor tasks requiring language understanding, perception, and navigation.

Future Directions

The survey suggests research should prioritize:

Flexible, dense, and efficient general-purpose semantic maps capable of real-time operation and broad downstream transfer;
Online, adaptive, and lifelong updating of semantic and geometric representations, with explicit uncertainty modeling;
Standardizing intrinsic evaluation metrics, moving beyond task completion to holistic map quality;
Advanced multi-modal fusion and querying architectures, reconciling real-time constraints with the complexity of large foundation models;
Hybrid and hierarchical structural designs balancing memory, scalability, spatial detail, and semantic abstraction.

Conclusion

This comprehensive survey (2501.05750) provides an authoritative categorization and critical analysis of methods for semantic map building in indoor embodied AI. By organizing research around core choices of map structure and semantic encoding, and by cross-referencing robotics and AI paradigms, it exposes major technical challenges and plots a focused trajectory for future advances. The paper strongly emphasizes the growing importance of general-purpose, open-vocabulary, and queryable semantic maps as foundational to the next generation of multi-modal, robust embodied intelligence.

References in this essay correspond to those in the original paper and should be mapped via (2501.05750).

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Semantic Mapping in Indoor Embodied AI -- A Survey on Advances, Challenges, and Future Directions

Summary

Semantic Mapping in Indoor Embodied AI: Advances, Challenges, and Future Directions

Overview and Motivation

Taxonomy of Semantic Map Representations

Map Structure: Spatial, Topological, Dense Geometric, Hybrid

Semantic Encoding: Explicit vs. Implicit

Relation to SLAM and Embodied System Design

Open-Vocabulary and Multimodal Semantic Mapping

Key Numerical and Methodological Results

Evaluation: Intrinsic and Extrinsic Metrics

Current Challenges

Implications and Discussion

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (2)

Collections

Tweets

Semantic Mapping in Indoor Embodied AI -- A Survey on Advances, Challenges, and Future Directions

Summary

Semantic Mapping in Indoor Embodied AI: Advances, Challenges, and Future Directions

Overview and Motivation

Taxonomy of Semantic Map Representations

Map Structure: Spatial, Topological, Dense Geometric, Hybrid

Semantic Encoding: Explicit vs. Implicit

Relation to SLAM and Embodied System Design

Open-Vocabulary and Multimodal Semantic Mapping

Key Numerical and Methodological Results

Evaluation: Intrinsic and Extrinsic Metrics

Current Challenges

Implications and Discussion

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (2)

Collections

Tweets