SENT-Map: Semantic Indoor Navigation

Updated 12 November 2025

SENT-Map is a semantically enhanced topological mapping framework that integrates JSON-encoded semantic nodes for clear indoor navigation.
It fuses symbolic scene representations and object affordance annotations with foundation model outputs to drive precise, interpretable planning.
Empirical evaluations show that semantic enhancements boost task success from 38.9% to 100% by reducing ambiguity in multi-object environments.

SENT-Map is a semantically enhanced topological mapping framework for indoor environments, designed to fuse symbolic scene representations, explicit object and affordance annotations, and the planning capabilities of foundation models (FMs). Originating in the context of autonomous robotic navigation and manipulation, SENT-Map integrates human-interpretable, JSON-encoded semantic nodes into a topological navigation graph to enable reliable, FM-driven planning and closed-loop task execution; it addresses ambiguity, FM hallucinations, and efficient semantic reasoning in robot environments (Kathirvel et al., 5 Nov 2025).

1. Formal Representation of SENT-Map

A SENT-Map is mathematically defined as a semantically annotated topological graph $G=(V,E)$ , where:

$V = \{v_1, ..., v_N\}$ is the set of navigable waypoints.
$E \subseteq V\times V$ encodes feasible transitions between waypoints.
$V_{SE}\subseteq V$ are the semantic nodes, each carrying an annotation via a descriptor $S(v)$ , mapping $S: V_{SE}\to \mathcal{S}$ where $\mathcal{S}$ is the space of human- and FM-readable JSON-encoded semantic data.

The overall SENT-Map structure is $M=(V, E, \{S(v)\,|\,v\in V_{SE}\})$ . The annotation schema for each semantic node is:

{
  "node_id": "string",
  "pose": { "x": float, "y": float, "θ": float },
  "edges": [ { "to": "string", "cost": float }, ... ],
  "type": "fridge" | "table" | ...,
  "affordances": [ "open", "push", ... ],
  "objects": [
    {
      "object_id": "string",
      "class": "string",
      "state": "string",
      "owner": "string|null"
    }, ...
  ],
  "metadata": { ... }
}

Edges between nodes are represented as:

{
  "edges": [
    { "from": "v_i", "to": "v_j", "action": "move_forward", "cost": 1.0 }, ...
  ]
}

This explicit and extensible schema supports both automated FM parsing and human supervision during map creation and maintenance.

2. Mapping Stage: Methodology and Implementation

The SENT-Map construction operates in two substages:

Topological map creation: Using topological-SLAM or teleoperation, waypoints $V$ and edges $E$ are collected, covering navigable and functionally relevant poses.
Semantic node annotation: At selected locations, the operator flags a site for annotation. An RGB snapshot $I_v$ is captured, then a Vision-FM is prompted with $I_v$ and a schema template $T$ to generate a preliminary JSON node. This annotation is reviewed/edited by the operator before final inclusion.

The mapping function, $M_{FM}$ , is defined as $S(v) \leftarrow M_{FM}(I_v, T)$ . The implementation workflow, as formalized in pseudocode, is:

procedure Build_SENT_Map()
  V ← {} ; E ← {} ; Scene_JSON ← {}
  while operator guides robot do
    (v_prev, v_new, edge_info) ← record_navigation()
    V.add(v_new)
    E.add(edge(v_prev→v_new, edge_info))
    if operator flags_semantic_site() then
      I ← capture_RGB_snapshot()
      node_json ← VisionFM.generate_semantic_node(I, JSON_template)
      node_json ← operator_edit(node_json)
      Scene_JSON.nodes[v_new] ← node_json
    end if
  end while
  Scene_JSON.edges ← serialize_edges(E)
  return Scene_JSON
end procedure

This approach tightly couples operator supervision with FM-powered perception, ensuring semantic completeness and error correction in environments where raw perception is insufficient or ambiguous.

3. FM-Grounded Planning and Task Execution

Planning in the SENT-Map paradigm is achieved through prompting a Planning FM with both the structured map and the task intent:

Inputs: Full Scene_JSON, Skill_API (defining feasible robot actions), robot constraints, and a user-formulated natural language query (e.g., "Get Bob’s coffee").
Processing: The FM is required to emit a sequence of high-level skills $\mathbf{p}=[s_1...s_K]$ $p = [s_{1} ... s_{K}]$ conforming to Skill_API, grounded by the topological graph and active semantic context:
- If $s_i=\text{move\_to}(v_j)$ , $v_j$ must be on a valid path in $G$ .
- If $s_i=\text{pick}(o)$ , $o$ must be among the objects attached to the current semantic node.

Plan validity constraints are strictly enforced; optional plan cost minimization (summed over skill step costs) is supported within feasible plans.

The planning prompt follows this structure:

You are a robot planner.
Scene: <Scene_JSON>
Robot skills: 
  - move_to(node_id)
  - open(object_id)
  - pick(object_id)
  - place(object_id, node_id)
Constraints:
  - Only perform skill if allowed by the Scene_JSON.
Task: <User_query>
Output: JSON plan { "steps": [ { "skill":..., "args":... }, … ] }

The planning procedure, as pseudocode, is:

procedure FM_Plan(scene, skills, query)
  prompt ← assemble_prompt(scene, skills, query)
  plan_json ← PlanningFM.complete(prompt)
  plan ← parse_and_validate(plan_json, scene, skills)
  return plan
end procedure

Task execution is exported as a parseable JSON plan ensuring compatibility with traditional robot control stacks.

4. Empirical Validation: Experimental Setup and Results

Evaluations were performed on an indoor office–lounge–kitchen environment subdivided into 3 zones, containing 9 semantic nodes and 23 objects. Mapping utilized Llama-3.2-90B Vision Instruct. Multiple Planning FMs were benchmarked: Gemma 3 27B, Gemini Flash 2.0, Llama 3.1 (8B and 405B), and two GPT variants (4o mini, o3 mini).

Task suite: Object-centric tasks such as "Get-Sponge," "Get-Coffee," and "Get-Tissue" were tested over three trials per query. Two map variants were compared:

Baseline: node labels only (omitting objects/affordances).
Semantic enhancement (SE): full object and ownership context.

Key performance results:

With baseline maps, mean FM task success was $38.9\%$ .
Under semantic enhancement, all tested FMs achieved $100\%$ success on all retrieval tasks.

Model	Baseline Avg	SE Avg
Gemma 3 27B	66.7	100
Gemini Flash 2.0	0.0	100
Llama 3.1 8B	33.3	100
Llama 3.1 405B	33.3	100
GPT 4o mini	33.3	100
GPT o3 mini	66.7	100
Average	38.9	100

Indirect queries (e.g., ones requiring disambiguation by ownership or affordance) on Gemma 3 27B yielded:

Task type	Baseline	SE-only	SE+ownership
Direct	33.3%	100%	100%
Indirect	0%	100%	100%

Error analysis confirmed that FMs without semantic context were inconsistent and prone to ambiguous action plans, especially in multi-object situations. Inclusion of full semantic tags—especially object ownership—disambiguated intent entirely.

5. Analysis of Strengths, Limitations, and Architectural Impact

Strengths:

Explicit object/location annotations in JSON eliminate FM hallucinations and planning ambiguity.
Planning robustly grounds FM outputs to feasible actions and paths, with all constraints derived from introspectable map state.
Lightweight (27B) FMs proved sufficient for reliable planning under the semantic regime, bypassing the need for resource-intensive models in many applications.
Human-inspectable map representation permits manual correction, error tracing, and pre-execution verification.

Limitations:

Supervision and annotation effort scales linearly with environment and object complexity, introducing potential bottlenecks in large-scale deployments.
Deeply nested or voluminous JSON may present parsing and editing challenges for both small FMs and human users, necessitating tooling support or hierarchical map summarization.

Architectural implications: SENT-Map’s explicit, symbolic-augmented approach contrasts with end-to-end neural scene understanding pipelines. Its integration of operator/foundation model-in-the-loop mapping blends classical SLAM, symbolic AI, and modern FM-driven reasoning.

6. Future Directions and Relation to the Literature

Potential advancements discussed include:

Automated semantic site suggestion using active vision or 3D environment understanding.
Hierarchical or summarized representations to maintain tractability at scale.
Bi-directional operator/FM editing and UI co-design.
Support for dynamic updates (object/agent motion).
End-to-end integration with lower-level navigation stacks, including SLAM, collision avoidance, and motion planning.
Leverage of next-generation multimodal FMs for direct perception–map–affordance coupling.

SENT-Map builds on prior formal semantic mapping proposals (Capobianco et al., 2016), which emphasize ontologies, symbolic relationships, and benchmarking, and it is conceptually contiguous with pipelines such as QueSTMap (Mehan et al., 2024) for language-driven topological query and 3D-SMNet (Cartillier et al., 2024) for object-centric scene representation and matching. What distinguishes SENT-Map is its explicit JSON-structured, FM-readable symbolic grounding, FM-driven annotation and planning loop, and empirical focus on robust, transparent, operator-controllable task specification and execution.

7. Conclusion

SENT-Map provides a reproducible, interpretable, operator-supervisable framework for representing and exploiting indoor environments in autonomous systems. By fusing symbolic, human- and FM-interpretable mapping with the planning capabilities of foundation models, it guarantees disambiguated and feasible task planning, robustly scales to diverse environments, and establishes a platform for integrating next-generation multimodal perception, language understanding, and autonomous decision making.