Papers
Topics
Authors
Recent
2000 character limit reached

Verbalization of Path (VoP)

Updated 20 December 2025
  • VoP is a method that explicitly encodes navigation steps into natural language, creating cognitive maps from visual observations and landmark cues.
  • It constructs multi-part prompts to extract destination, current location, and detailed walking directions, enhancing spatial reasoning in urban settings.
  • Quantitative results show success rates improving from 15% to 92%, demonstrating VoP's effectiveness in robust real-world navigation.

Verbalization of Path (VoP) denotes a family of methodologies for explicit, step-wise natural language encoding of agent navigation—typically in urban, street-view, or embodied visual environments—to operationalize, probe, and ground the internal spatial reasoning of LLMs or multimodal LLMs (MLLMs). VoP pipelines transform otherwise latent model knowledge about world geography, landmark semantics, and trajectory structure into an externalized “cognitive map” comprising named entities (streets, buildings), turn directions, and visually/verbally salient cues, facilitating substantial gains in wayfinding, localization, and policy learning under conditions of weak or sparse external supervision (Dalal et al., 17 Dec 2025, Schumann et al., 2023).

1. Motivations for Explicit Path Verbalization

The core motivation for VoP derives from limitations in conventional decision policies for navigation using LLMs or MLLMs, which typically fail when required to act upon visual inputs alone, without maps, GPS, or predefined symbolic annotations. In real-world “sparsely grounded” navigation tasks—where agents must select routes across city graphs using only street-level imagery at each intersection—off-the-shelf models that simply select a direction often drift, loop, or fail to localize, largely due to the absence of explicit memory, grounding, and subgoal structure. VoP addresses this bottleneck by extracting the model’s implicit global knowledge through multi-part prompts, requiring natural language output of (a) the destination’s exact location, (b) an estimated current location, and (c) step-by-step walking directions, refreshed at every turn. This process transforms raw visual observations into a persistent, interpretable, and updatable cognitive map (Dalal et al., 17 Dec 2025).

2. Formal Architectures and Algorithmic Pipeline

The canonical VoP algorithm (exemplified in AgentNav) operates over a graph G=(V,E)G=(V,E), where each vertex vtv_t corresponds to a city intersection and oto_t is the set of street-view images for outgoing edges. At time tt, the agent maintains:

  • mt1m_{t-1}: a Markovian memory of past steps,
  • Mt1=(Lt1,Dt1)M_{t-1} = (L_{t-1}, D_{t-1}): the cognitive map (with LL the verbalized landmark set, DD the sequence of directional cues).

At each step, the following sequence is executed:

  1. Prompt Construction: Compile a prompt PtP_t that requests (i) destination location, (ii) current estimated location, (iii) walking directions from current location to goal, appending recent image observations and memory state.
  2. Model Execution: Query the MLLM: outputt=MLLM(Pt)\text{output}_t = \text{MLLM}(P_t).
  3. Parsing Verbal Output: Extract from outputt\text{output}_t the destination (locdest\operatorname{loc}_{dest}), current location (loccur\operatorname{loc}_{cur}), and a natural-language path (NLpath\mathrm{NL}_{path}).
  4. Landmark and Direction Extraction: Parse all entities (e.g., “5th Ave”, “Bryant Park”) and step-wise directions (e.g., [north, east,…]) from NLpath\mathrm{NL}_{path}.
  5. Cognitive Map Update: Merge new items to update Lt,DtL_t, D_t, and set Mt=(Lt,Dt)M_t=(L_t, D_t).
  6. Action Selection: Choose action ata_t by mapping the first direction in NLpath\mathrm{NL}_{path} to the outgoing edge whose heading best matches.
  7. Step Execution and Memory Update: Execute ata_t, advance to vt+1v_{t+1}, and update mtm_t.

The loop continues until estimated current location is within the goal region. The verbalization function V:(L,D)SV: (L, D) \to S maps the cognitive map to a natural-language string encoding the current planned route (e.g., “Walk north on 5th Ave past Bryant Park, then turn east on 42nd St…”) (Dalal et al., 17 Dec 2025).

3. Cognitive Map Formalism and Prompt Construction

The VoP approach represents the cognitive map at time tt as Mt=(Lt,Dt)M_t = (L_t, D_t), with Lt={1,,k}L_t = \{\ell_1, \ldots, \ell_k\} as the set of all landmarks mentioned so far, and Dt={δ1,,δk}D_t = \{\delta_1, \ldots, \delta_k\} the corresponding directions (drawn from {N, NE, E, ..., NW}). The verbalization function VV translates these into a single string SS for prompting or memory storage.

In VELMA, another VoP realization, the function vv aggregates at each step the agent’s panoramic observation (P1:tP_{1:t}) and static landmark list (LL) into a trajectory summary: St=v(P1:t;L)=o1otS_t = v(P_{1:t}; L) = o_1 \circ \ldots \circ o_t, where each oto_t includes both the intersection type (“There is a 4-way intersection”) and the results of landmark visibility scoring (“There is a Starbucks on your right”) computed using CLIP-based vision–language similarity (Schumann et al., 2023).

Prompt construction thus combines these incremental text snippets—along with action history—to maximize the model’s contextual memory and ensure step-wise alignment with environmental cues.

4. Landmark Extraction and Visual Grounding

Landmark extraction in VoP systems is typically achieved via LLM-based, few-shot prompt pipelines, which parse navigation instructions or model-generated path descriptions to extract named entities and phrases (e.g., “Starbucks”, “bank”, “library”). Landmark visibility is determined using vision–language similarity: for each panoramic view, the CLIP image encoder is compared to the text encoder output for “picture of [landmark]”, yielding a similarity score. Standardized zz-scoring (against a background corpus) enables robust detection (e.g., z>3.5z>3.5 signals visibility), after which the relative direction ("left", "right", etc.) is logged (Schumann et al., 2023).

This feature set is then verbalized and appended to the agent’s trajectory history, upon which the LLM is repeatedly queried to predict the next action. The resulting prompt contains all incremental environmental evidence, supporting robust step-wise spatial reasoning.

5. Quantitative Results and Comparative Analysis

In CityNav experiments using the AgentNav system, VoP yields significant advances over baseline policies. The table below summarizes selected metrics for GPT-4.1:

Model / Setting Success (%) SPL Decision Acc. (%)
GPT-4.1 (base) 15 0.097 42.3
GPT-4.1 + VoP 92 0.557 75.3

Comparable boosts are observed across other models: GPT-4o (13→88%), GPT-5 (54→94%), Gemini 2.5 (12→73%), and Qwen 2.5 (7→32%) in New York, with strong but more modest gains in Tokyo, Vienna, and São Paulo (AgentNav reaching 17–56% success) (Dalal et al., 17 Dec 2025).

Ablation studies attribute incremental performance increases to individual memory and verbalization modules. Markovian memory alone raises success to 23%, decision history to 29%, and previous-visit tracking to 35%. Prompting for destination (“partially verbalized path”) yields 66%, whereas the complete three-phrase VoP (destination, current location, walking directions) plus memory achieves 92% success, demonstrating the necessity of full spatiolinguistic grounding.

VoP methods have been integrated both in reproduction of human instruction following and autonomous, goal-agnostic navigation. In VELMA, VoP produces an incremental verbal record of observed intersections and landmark alignments, maintaining an explicit trajectory history to guide LLM-driven policy prediction at each step. The visual-verbal consistency and trajectory memory allow agents to surpass prior embodied vision–language navigation benchmarks (Schumann et al., 2023).

Distinct from pure chain-of-thought or reflection-based reasoning, VoP’s explicit querying and storage of natural-language cognitive maps enables substantially enhanced localization, prevention of cyclic or drifting behavior, and order-of-magnitude improvement in long-horizon city navigation, particularly in situations characterized by sparse grounding cues (Dalal et al., 17 Dec 2025).

7. Context, Limitations, and Prospective Directions

The efficacy of Verbalization of Path implies the necessity of explicit knowledge grounding for robust agent navigation in unstructured, real-world urban graphs. The approach serves as a bridge between latent world knowledge in web-scale MLLMs and the demands of practical, step-wise environmental interaction. Current results indicate that simple addition of memory modules is insufficient to achieve near-perfect navigation; only when the agent is compelled to verbalize destination, location, and end-to-end path structure per step are dramatic performance gains realized.

No published evaluation to date has reported negative transfer from VoP to other contextual reasoning tasks, but further investigation in more diverse, less visually structured environments is warranted. A plausible implication is that VoP-like explicit verbalization and grounding architectures may play a central role in bridging the gap between open-ended LLMs and robust, real-world sequential decision-making agents.

References: (Dalal et al., 17 Dec 2025, Schumann et al., 2023)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Verbalization of Path (VoP).