Verbalization of Path (VoP)

Updated 20 December 2025

VoP is a method that explicitly encodes navigation steps into natural language, creating cognitive maps from visual observations and landmark cues.
It constructs multi-part prompts to extract destination, current location, and detailed walking directions, enhancing spatial reasoning in urban settings.
Quantitative results show success rates improving from 15% to 92%, demonstrating VoP's effectiveness in robust real-world navigation.

Verbalization of Path (VoP) denotes a family of methodologies for explicit, step-wise natural language encoding of agent navigation—typically in urban, street-view, or embodied visual environments—to operationalize, probe, and ground the internal spatial reasoning of LLMs or multimodal LLMs (MLLMs). VoP pipelines transform otherwise latent model knowledge about world geography, landmark semantics, and trajectory structure into an externalized “cognitive map” comprising named entities (streets, buildings), turn directions, and visually/verbally salient cues, facilitating substantial gains in wayfinding, localization, and policy learning under conditions of weak or sparse external supervision (Dalal et al., 17 Dec 2025, Schumann et al., 2023).

1. Motivations for Explicit Path Verbalization

The core motivation for VoP derives from limitations in conventional decision policies for navigation using LLMs or MLLMs, which typically fail when required to act upon visual inputs alone, without maps, GPS, or predefined symbolic annotations. In real-world “sparsely grounded” navigation tasks—where agents must select routes across city graphs using only street-level imagery at each intersection—off-the-shelf models that simply select a direction often drift, loop, or fail to localize, largely due to the absence of explicit memory, grounding, and subgoal structure. VoP addresses this bottleneck by extracting the model’s implicit global knowledge through multi-part prompts, requiring natural language output of (a) the destination’s exact location, (b) an estimated current location, and (c) step-by-step walking directions, refreshed at every turn. This process transforms raw visual observations into a persistent, interpretable, and updatable cognitive map (Dalal et al., 17 Dec 2025).

2. Formal Architectures and Algorithmic Pipeline

The canonical VoP algorithm (exemplified in AgentNav) operates over a graph $G=(V,E)$ , where each vertex $v_t$ corresponds to a city intersection and $o_t$ is the set of street-view images for outgoing edges. At time $t$ , the agent maintains:

$m_{t-1}$ : a Markovian memory of past steps,
$M_{t-1} = (L_{t-1}, D_{t-1})$ : the cognitive map (with $L$ the verbalized landmark set, $D$ the sequence of directional cues).

At each step, the following sequence is executed:

Prompt Construction: Compile a prompt $P_t$ that requests (i) destination location, (ii) current estimated location, (iii) walking directions from current location to goal, appending recent image observations and memory state.
Model Execution: Query the MLLM: $\text{output}_t = \text{MLLM}(P_t)$ .
Parsing Verbal Output: Extract from $\text{output}_t$ the destination ( $\operatorname{loc}_{dest}$ ), current location ( $\operatorname{loc}_{cur}$ ), and a natural-language path ( $\mathrm{NL}_{path}$ ).
Landmark and Direction Extraction: Parse all entities (e.g., “5th Ave”, “Bryant Park”) and step-wise directions (e.g., [north, east,…]) from $\mathrm{NL}_{path}$ .
Cognitive Map Update: Merge new items to update $L_t, D_t$ , and set $M_t=(L_t, D_t)$ .
Action Selection: Choose action $a_t$ by mapping the first direction in $\mathrm{NL}_{path}$ to the outgoing edge whose heading best matches.
Step Execution and Memory Update: Execute $a_t$ , advance to $v_{t+1}$ , and update $m_t$ .

The loop continues until estimated current location is within the goal region. The verbalization function $V: (L, D) \to S$ maps the cognitive map to a natural-language string encoding the current planned route (e.g., “Walk north on 5th Ave past Bryant Park, then turn east on 42nd St…”) (Dalal et al., 17 Dec 2025).

3. Cognitive Map Formalism and Prompt Construction

The VoP approach represents the cognitive map at time $t$ as $M_t = (L_t, D_t)$ , with $L_t = \{\ell_1, \ldots, \ell_k\}$ as the set of all landmarks mentioned so far, and $D_t = \{\delta_1, \ldots, \delta_k\}$ the corresponding directions (drawn from {N, NE, E, ..., NW}). The verbalization function $V$ translates these into a single string $S$ for prompting or memory storage.

In VELMA, another VoP realization, the function $v$ aggregates at each step the agent’s panoramic observation ( $P_{1:t}$ ) and static landmark list ( $L$ ) into a trajectory summary: $S_t = v(P_{1:t}; L) = o_1 \circ \ldots \circ o_t$ , where each $o_t$ includes both the intersection type (“There is a 4-way intersection”) and the results of landmark visibility scoring (“There is a Starbucks on your right”) computed using CLIP-based vision–language similarity (Schumann et al., 2023).

Prompt construction thus combines these incremental text snippets—along with action history—to maximize the model’s contextual memory and ensure step-wise alignment with environmental cues.

4. Landmark Extraction and Visual Grounding

Landmark extraction in VoP systems is typically achieved via LLM-based, few-shot prompt pipelines, which parse navigation instructions or model-generated path descriptions to extract named entities and phrases (e.g., “Starbucks”, “bank”, “library”). Landmark visibility is determined using vision–language similarity: for each panoramic view, the CLIP image encoder is compared to the text encoder output for “picture of [landmark]”, yielding a similarity score. Standardized $z$ -scoring (against a background corpus) enables robust detection (e.g., $z>3.5$ signals visibility), after which the relative direction ("left", "right", etc.) is logged (Schumann et al., 2023).

This feature set is then verbalized and appended to the agent’s trajectory history, upon which the LLM is repeatedly queried to predict the next action. The resulting prompt contains all incremental environmental evidence, supporting robust step-wise spatial reasoning.

5. Quantitative Results and Comparative Analysis

In CityNav experiments using the AgentNav system, VoP yields significant advances over baseline policies. The table below summarizes selected metrics for GPT-4.1:

Model / Setting	Success (%)	SPL	Decision Acc. (%)
GPT-4.1 (base)	15	0.097	42.3
GPT-4.1 + VoP	92	0.557	75.3

Comparable boosts are observed across other models: GPT-4o (13→88%), GPT-5 (54→94%), Gemini 2.5 (12→73%), and Qwen 2.5 (7→32%) in New York, with strong but more modest gains in Tokyo, Vienna, and São Paulo (AgentNav reaching 17–56% success) (Dalal et al., 17 Dec 2025).

Ablation studies attribute incremental performance increases to individual memory and verbalization modules. Markovian memory alone raises success to 23%, decision history to 29%, and previous-visit tracking to 35%. Prompting for destination (“partially verbalized path”) yields 66%, whereas the complete three-phrase VoP (destination, current location, walking directions) plus memory achieves 92% success, demonstrating the necessity of full spatiolinguistic grounding.

VoP methods have been integrated both in reproduction of human instruction following and autonomous, goal-agnostic navigation. In VELMA, VoP produces an incremental verbal record of observed intersections and landmark alignments, maintaining an explicit trajectory history to guide LLM-driven policy prediction at each step. The visual-verbal consistency and trajectory memory allow agents to surpass prior embodied vision–language navigation benchmarks (Schumann et al., 2023).

Distinct from pure chain-of-thought or reflection-based reasoning, VoP’s explicit querying and storage of natural-language cognitive maps enables substantially enhanced localization, prevention of cyclic or drifting behavior, and order-of-magnitude improvement in long-horizon city navigation, particularly in situations characterized by sparse grounding cues (Dalal et al., 17 Dec 2025).

7. Context, Limitations, and Prospective Directions

The efficacy of Verbalization of Path implies the necessity of explicit knowledge grounding for robust agent navigation in unstructured, real-world urban graphs. The approach serves as a bridge between latent world knowledge in web-scale MLLMs and the demands of practical, step-wise environmental interaction. Current results indicate that simple addition of memory modules is insufficient to achieve near-perfect navigation; only when the agent is compelled to verbalize destination, location, and end-to-end path structure per step are dramatic performance gains realized.

No published evaluation to date has reported negative transfer from VoP to other contextual reasoning tasks, but further investigation in more diverse, less visually structured environments is warranted. A plausible implication is that VoP-like explicit verbalization and grounding architectures may play a central role in bridging the gap between open-ended LLMs and robust, real-world sequential decision-making agents.

References: (Dalal et al., 17 Dec 2025, Schumann et al., 2023)