SignScene: Spatial–Semantic Navigation
- SignScene is a spatial–semantic grounding system that interprets human-oriented sign instructions and converts them into actionable 3D navigation goals.
- It integrates modular components such as sign detection, semantic parsing, and spatial fusion using advanced vision-language models for precise scene mapping.
- Empirical evaluations demonstrate an 88.6% grounding accuracy in diverse real-world settings, supporting successful deployment on robotic platforms like Boston Dynamics Spot.
SignScene is a spatial–semantic grounding system for mapless robotic navigation, enabling robots to interpret and act upon human-oriented navigational signs in previously unseen environments. The system formalizes sign grounding as the challenge of mapping semantic sign instructions (symbolic and textual content) onto corresponding elements within a 3D scene, thereby generating actionable navigation goals. Leveraging recent advances in vision-LLMs (VLMs), SignScene introduces a structured sign-centric representation and an integrated pipeline that includes perception, semantic parsing, spatial-mapping, VLM reasoning, and action generation. Empirical evaluation demonstrates robust navigation performance in the open world, including real-world deployment on Boston Dynamics Spot (Zimmerman et al., 13 Feb 2026).
1. System Architecture and Workflow
SignScene consists of a modular, sequential pipeline converting multi-modal sensor data into executed navigation actions by detecting, parsing, and grounding semantically rich signs. The system is structured into the following high-level components:
- Perception & Mapping:
- Sign Detection & Segmentation: Open-set detector (GroundingDINO) with the “signs” prompt, followed by segmentation mask refinement (SAM2). 3D point clusters for detected signs are generated using sparse depth and fused over time for centroid and normal estimation.
- Traversable Path Extraction: Extraction of drivable regions via point cloud segmentation (GeNIE).
- Explicit Structure Detection: Oriented 3D bounding box detection of navigational objects (walls, doors, corridors, escalators, stairs) with class labels and confidence scores.
- 3D Spatial Fusion: Temporal fusion across frames yields a local map integrating explicit and implicit structures.
- Sign Understanding:
- In-Context VLM Parsing: Signs passing spatial proximity and alignment thresholds (, ) are parsed for textual and symbolic cues through a few-shot VLM prompt using an in-context symbol dictionary with parsing outputs in a structured JSON schema.
- Temporal Filtering & Cue Merging: Stabilizes extracted cues across multiple observations for robustness.
- Abstract Top-view Map (AToM) Construction:
- Sign-centric Frame Alignment: Local map is centered and rotated to the detected sign's frame.
- Projection: 3D point representations are abstracted as a 2D top-down diagram, annotating explicit objects (e.g., doors, escalators) with labeled bounding boxes and path frontiers with symbolic letters.
- Spatial–Semantic Reasoning (VLM Prompting):
- Fuzzy Location Matching: Implements normalized Levenshtein distance for label similarity between sign content and query.
- Prompt Module: Presents AToM diagram and parsing outputs in a structured prompt to the VLM.
- Discrete Candidate Selection: VLM response selects the nearest candidate object or path by label.
- Action Generation & Execution:
- VLM-selected elements are mapped back to spatial coordinates (SE(2) pose) and passed to a local planner for obstacle avoidance and trajectory execution.
2. Spatial–Semantic Representation
The foundational representation in SignScene is a 3D map aggregating three classes of elements:
- Navigational Signs: Symbolic cues and full 3D poses.
- Implicit Paths: Fused point clouds depicting traversable space, later reduced to polygonal contours.
- Explicit Structures: Navigation-relevant objects, each encoded as (centroid, size, and yaw orientation), including semantic label and confidence score.
Walls, doors, corridors, escalators, stairs, and similar features are detected with a fixed vocabulary via GroundingDINO. GeNIE segments drivable corridors and sidewalks. Navigational sign cues are structured as sets of label-direction pairs: .
Spatial and relational reasoning employs fuzzy text matching between query locations and sign labels, as well as convexity-based heuristics for identifying candidate frontiers in the path polygon. Unlike graph-attentive models, SignScene adopts heuristic relational reasoning, prioritizing sample efficiency and interpretability.
3. Vision-LLM Integration
VLMs are central to both sign cue parsing and spatial-semantic reasoning:
- For sign parsing, a few-shot VLM prompt includes an in-context symbol lexicon and mandates outputs matching a JSON schema categorizing directions (“forward,” “left,” “up-escalator,” etc.).
- For spatial reasoning, the system prepares a prompt embedding the AToM diagram with candidate path entrances (letters) and explicit structures (bounding-box labels). The prompt template requests selection of the candidate closest to the specified (sign-indicated) goal, instructing reasoning about spatial layout and object affordance (e.g., “lifts, stairs, and escalators allow you to go up and down”).
VLM outputs are parsed to extract the selected answer (e.g., “[C]” or “[Escalator]”), which is mapped back to the corresponding scene element and used for subgoal pose determination. Among multiple matching signs, that with the highest fuzzy match score is preferentially grounded.
4. Evaluation and Benchmarking
SignScene was evaluated on a custom dataset comprising 36 sequences and 114 multiple-choice grounding queries, recorded across nine diverse environment types (including hospitals, malls, transit stations, airports, and university campuses). Input modalities included high-resolution RGB, sparse depth, and visual–inertial odometry aligned to real-world scenes.
Grounding accuracy is computed as the proportion of correctly grounded queries. SignScene achieved accuracy, substantially outperforming ReasonNav (26.3%) and an ablation baseline (“No-Rotation,” 64.9%). A full per-environment breakdown confirms superior performance across all settings.
| Method | Accuracy (%) | Description |
|---|---|---|
| SignScene | 88.6 | Full sign-centric pipeline |
| No-Rotation ablation | 64.9 | AToM without sign-frame alignment |
| ReasonNav (CoRL 2025) | 26.3 | Baseline (Chandaka et al. 2025) |
5. Real-World Robotic Deployment
SignScene was deployed on Boston Dynamics Spot equipped with a monocular RGB camera, visual–inertial odometry, an articulated inspection arm, and NVIDIA Jetson Orin AGX for onboard computation (with VLM queries processed off-board). The navigation control pipeline proceeds as follows:
- Detect and select signs whose parsed cues reference the user’s goal.
- Visually servo Spot to head-on alignment with the relevant sign.
- Explore environmental frontiers to construct the full AToM.
- Ground the navigation target using VLM reasoning and derive the subgoal (SE(2) pose).
- Execute navigation to the subgoal using the onboard local planner with dynamic obstacle avoidance.
Empirical deployment included successful live trials (e.g., navigating from an outdoor starting point to a building’s “TERRACE” via stair grounding). Reported system latencies were approximately 3 s for sign parsing (Gemini-2.5-Pro) and 20 s for sign grounding (GPT-5).
6. Limitations and Extensions
Documented limitations include:
- Potential omission or misclassification of explicit structures due to object detector recall/precision, directly impacting AToM quality.
- VLM struggles with compound instructions (“left-then-forward”).
- Partial observation in crowded scenes risk leaving key paths unmapped; the exploration module partially mitigates this.
- VLM query latency limits system reactivity in highly dynamic settings.
- Occasional mis-grounding from VLM token biases, e.g., defaulting to “Exit C.”
Proposed extensions comprise integrating dynamic scene updates for crowd handling and real-time re-planning, extending semantic parsing for multi-floor and 3D vertical navigation, onboard LLM integration for reduced latency, and end-to-end learned spatial–language attention modules to automate AToM construction without hand-coded abstraction.
7. Significance and Relation to Prior Work
SignScene establishes a sign-centric spatial–semantic representation pipeline for robust mapless navigation using real-world signage, evaluated in diverse, previously unseen environments. By closely coupling visual perception, structured map abstraction, and VLM-powered reasoning, the system outperforms prior baselines by a substantial margin and demonstrates practical viability for real-world robotic deployment. The approach highlights the promise of explicit scene abstraction and prompt engineering in leveraging general-purpose VLM capabilities for grounded AI and embodied intelligence (Zimmerman et al., 13 Feb 2026).