Verbalization of Path (VoP)
- VoP is a method that explicitly encodes navigation steps into natural language, creating cognitive maps from visual observations and landmark cues.
- It constructs multi-part prompts to extract destination, current location, and detailed walking directions, enhancing spatial reasoning in urban settings.
- Quantitative results show success rates improving from 15% to 92%, demonstrating VoP's effectiveness in robust real-world navigation.
Verbalization of Path (VoP) denotes a family of methodologies for explicit, step-wise natural language encoding of agent navigation—typically in urban, street-view, or embodied visual environments—to operationalize, probe, and ground the internal spatial reasoning of LLMs or multimodal LLMs (MLLMs). VoP pipelines transform otherwise latent model knowledge about world geography, landmark semantics, and trajectory structure into an externalized “cognitive map” comprising named entities (streets, buildings), turn directions, and visually/verbally salient cues, facilitating substantial gains in wayfinding, localization, and policy learning under conditions of weak or sparse external supervision (Dalal et al., 17 Dec 2025, Schumann et al., 2023).
1. Motivations for Explicit Path Verbalization
The core motivation for VoP derives from limitations in conventional decision policies for navigation using LLMs or MLLMs, which typically fail when required to act upon visual inputs alone, without maps, GPS, or predefined symbolic annotations. In real-world “sparsely grounded” navigation tasks—where agents must select routes across city graphs using only street-level imagery at each intersection—off-the-shelf models that simply select a direction often drift, loop, or fail to localize, largely due to the absence of explicit memory, grounding, and subgoal structure. VoP addresses this bottleneck by extracting the model’s implicit global knowledge through multi-part prompts, requiring natural language output of (a) the destination’s exact location, (b) an estimated current location, and (c) step-by-step walking directions, refreshed at every turn. This process transforms raw visual observations into a persistent, interpretable, and updatable cognitive map (Dalal et al., 17 Dec 2025).
2. Formal Architectures and Algorithmic Pipeline
The canonical VoP algorithm (exemplified in AgentNav) operates over a graph , where each vertex corresponds to a city intersection and is the set of street-view images for outgoing edges. At time , the agent maintains:
- : a Markovian memory of past steps,
- : the cognitive map (with the verbalized landmark set, the sequence of directional cues).
At each step, the following sequence is executed:
- Prompt Construction: Compile a prompt that requests (i) destination location, (ii) current estimated location, (iii) walking directions from current location to goal, appending recent image observations and memory state.
- Model Execution: Query the MLLM: .
- Parsing Verbal Output: Extract from the destination (), current location (), and a natural-language path ().
- Landmark and Direction Extraction: Parse all entities (e.g., “5th Ave”, “Bryant Park”) and step-wise directions (e.g., [north, east,…]) from .
- Cognitive Map Update: Merge new items to update , and set .
- Action Selection: Choose action by mapping the first direction in to the outgoing edge whose heading best matches.
- Step Execution and Memory Update: Execute , advance to , and update .
The loop continues until estimated current location is within the goal region. The verbalization function maps the cognitive map to a natural-language string encoding the current planned route (e.g., “Walk north on 5th Ave past Bryant Park, then turn east on 42nd St…”) (Dalal et al., 17 Dec 2025).
3. Cognitive Map Formalism and Prompt Construction
The VoP approach represents the cognitive map at time as , with as the set of all landmarks mentioned so far, and the corresponding directions (drawn from {N, NE, E, ..., NW}). The verbalization function translates these into a single string for prompting or memory storage.
In VELMA, another VoP realization, the function aggregates at each step the agent’s panoramic observation () and static landmark list () into a trajectory summary: , where each includes both the intersection type (“There is a 4-way intersection”) and the results of landmark visibility scoring (“There is a Starbucks on your right”) computed using CLIP-based vision–language similarity (Schumann et al., 2023).
Prompt construction thus combines these incremental text snippets—along with action history—to maximize the model’s contextual memory and ensure step-wise alignment with environmental cues.
4. Landmark Extraction and Visual Grounding
Landmark extraction in VoP systems is typically achieved via LLM-based, few-shot prompt pipelines, which parse navigation instructions or model-generated path descriptions to extract named entities and phrases (e.g., “Starbucks”, “bank”, “library”). Landmark visibility is determined using vision–language similarity: for each panoramic view, the CLIP image encoder is compared to the text encoder output for “picture of [landmark]”, yielding a similarity score. Standardized -scoring (against a background corpus) enables robust detection (e.g., signals visibility), after which the relative direction ("left", "right", etc.) is logged (Schumann et al., 2023).
This feature set is then verbalized and appended to the agent’s trajectory history, upon which the LLM is repeatedly queried to predict the next action. The resulting prompt contains all incremental environmental evidence, supporting robust step-wise spatial reasoning.
5. Quantitative Results and Comparative Analysis
In CityNav experiments using the AgentNav system, VoP yields significant advances over baseline policies. The table below summarizes selected metrics for GPT-4.1:
| Model / Setting | Success (%) | SPL | Decision Acc. (%) |
|---|---|---|---|
| GPT-4.1 (base) | 15 | 0.097 | 42.3 |
| GPT-4.1 + VoP | 92 | 0.557 | 75.3 |
Comparable boosts are observed across other models: GPT-4o (13→88%), GPT-5 (54→94%), Gemini 2.5 (12→73%), and Qwen 2.5 (7→32%) in New York, with strong but more modest gains in Tokyo, Vienna, and São Paulo (AgentNav reaching 17–56% success) (Dalal et al., 17 Dec 2025).
Ablation studies attribute incremental performance increases to individual memory and verbalization modules. Markovian memory alone raises success to 23%, decision history to 29%, and previous-visit tracking to 35%. Prompting for destination (“partially verbalized path”) yields 66%, whereas the complete three-phrase VoP (destination, current location, walking directions) plus memory achieves 92% success, demonstrating the necessity of full spatiolinguistic grounding.
6. Integration in Embodied Navigation and Related Approaches
VoP methods have been integrated both in reproduction of human instruction following and autonomous, goal-agnostic navigation. In VELMA, VoP produces an incremental verbal record of observed intersections and landmark alignments, maintaining an explicit trajectory history to guide LLM-driven policy prediction at each step. The visual-verbal consistency and trajectory memory allow agents to surpass prior embodied vision–language navigation benchmarks (Schumann et al., 2023).
Distinct from pure chain-of-thought or reflection-based reasoning, VoP’s explicit querying and storage of natural-language cognitive maps enables substantially enhanced localization, prevention of cyclic or drifting behavior, and order-of-magnitude improvement in long-horizon city navigation, particularly in situations characterized by sparse grounding cues (Dalal et al., 17 Dec 2025).
7. Context, Limitations, and Prospective Directions
The efficacy of Verbalization of Path implies the necessity of explicit knowledge grounding for robust agent navigation in unstructured, real-world urban graphs. The approach serves as a bridge between latent world knowledge in web-scale MLLMs and the demands of practical, step-wise environmental interaction. Current results indicate that simple addition of memory modules is insufficient to achieve near-perfect navigation; only when the agent is compelled to verbalize destination, location, and end-to-end path structure per step are dramatic performance gains realized.
No published evaluation to date has reported negative transfer from VoP to other contextual reasoning tasks, but further investigation in more diverse, less visually structured environments is warranted. A plausible implication is that VoP-like explicit verbalization and grounding architectures may play a central role in bridging the gap between open-ended LLMs and robust, real-world sequential decision-making agents.
References: (Dalal et al., 17 Dec 2025, Schumann et al., 2023)