DreamNav: Zero-Shot Vision-Language Navigation

Updated 21 September 2025

DreamNav is a trajectory-centric, zero-shot vision-language navigation framework that integrates egocentric perception correction, diffusion-based trajectory planning, and imagination-driven anticipatory planning.
Its egocentric perception corrector employs macro and micro adjustments to realign views and enhance semantic alignment with language instructions.
The framework achieves state-of-the-art navigation performance by fusing trajectory-level planning with proactive imagination, setting new benchmarks under zero-shot VLN protocols.

DreamNav is a trajectory-centric, imaginative framework for zero-shot vision-and-language navigation (VLN) in continuous environments. It unifies egocentric perception correction, trajectory-level planning using diffusion policies, and anticipatory planning through active rollout imagination. This architecture is designed to improve the semantic alignment of actions to language instructions, reduce the cost of perception, and support long-horizon, proactive navigation with minimal reliance on panoramic sensing, thus setting new state-of-the-art results under zero-shot VLN protocols (Wang et al., 14 Sep 2025).

1. Egocentric Perception Correction

The EgoView Corrector is a hierarchical perception realignment module that stabilizes and reorients the agent’s view to maximize semantic alignment with instructions and navigable affordances. It has two nested controllers:

Macro-Adjust Expert:

Operates at episode initialization and coarse error conditions. It selects coarse rotations (e.g., 90° turns, possibly repeated) when objects or landmarks mentioned in instruction L are occluded in the egocentric observation Iₜᴿᴳᴮ. The alignment is based on semantic reasoning—comparing observation content with instruction-derived cues.

Micro-Adjust Controller:

Refines the agent’s heading after trajectory execution to correct drift-induced misalignments. It segments the walkable area in Iₜᴿᴳᴮ using CLIP-prompted FastSAM to form a binary mask 𝑀̃ₜ over the image lattice Ω. Using a threshold θ, it computes occupancy dₜ; if the mask falls below θ, it calculates normalized occupancy differences between left (Ωᴸ) and right (Ωᴿ) halves:

$uₜ = \mathrm{sign}\left(\frac{|\tilde{M}_t \cap \Omega^L|}{|\Omega^L|} - \frac{|\tilde{M}_t \cap \Omega^R|}{|\Omega^R|}\right)$

Based on $uₜ$ , a corrective 30° turn (left or right) is executed. Adjustments repeat until $dₜ = 0$ or a maximum is reached.

This two-stage correction pipeline ensures the agent consistently faces high-affordance, instruction-consistent views, directly addressing the instability of egocentric sensors.

2. Trajectory-Level Planning via Diffusion Policy

The Trajectory Predictor advances from point-level action selection to trajectory-level planning, delivering a globally aligned sequence of control commands that better match the semantics of natural language instructions.

Feature Extraction:

Egocentric RGB-D observations Iₜ are decomposed into RGB (f_RGB-ViT) and depth (f_DEP-ViT) features, concatenated and fused in a transformer decoder $G_{dec}$ and encoded by $G_{enc}$ :

$v_t = G_{dec}\left(f_{\mathrm{RGB-ViT}}\left(I_t^{\mathrm{RGB}}\right) \| f_{\mathrm{DEP-ViT}}\left(I_t^{\mathrm{depth}}\right)\right)$

$k_t = G_{enc}\left(v_t\right)$

Multimodal Trajectory Generation:

A conditional U-Net ( $\epsilon_\theta$ ) with DDPM sampling denoises from Gaussian noise to produce $\tau = \{\Delta x_t, \Delta y_t, \Delta \mathrm{yaw}_t\}_{t=1}^{24}$ :

$P_{t-1} = \alpha \left(P_{t} - \gamma \cdot \epsilon_\theta(k_t, P_t, t) + \mathcal{N}(0, \sigma^2)\right)$

Trajectory Filtering:

Filters candidate trajectories for diversity using average per-step Euclidean distance and a farthest-first (max–min) selection mechanism:

$d\left(\tau^i, \tau^j\right) = \frac{1}{T} \sum_{t=1}^{T} \left\|[\Delta x_t^i - \Delta x_t^j, \Delta y_t^i - \Delta y_t^j]\right\|_2$

A compact, diverse set (defined by the Candidate Trajectory Number, CTN) is retained for subsequent selection.

This approach enables efficient sampling of semantically distinct, instruction-aligned long-horizon plans, better reflecting the intent of complex language instructions than pointwise policies.

3. Imagination-Driven Anticipatory Planning

The Imagination Predictor integrates proactive reasoning by simulating the future sensory aftermath of candidate trajectories:

Dream Walker Module:

For each predicted trajectory, the module employs a generative world model trained on large-scale, multi-view data to render a rollout sequence:

$\mathcal{W}: \left(I_t^{\mathrm{RGB}}, C_{0:\mathrm{IRL}}\right) \rightarrow V^{(\mathrm{IRL})} = (V_1^{\mathrm{RGB}}, ..., V_{\mathrm{IRL}}^{\mathrm{RGB}})$

where $C$ is the sequence of relative camera poses and IRL the imagination horizon.

Narration Expert:

Converts raw visual rollouts into succinct, instruction-relevant text summaries using targeted prompts (e.g., summarizing walking direction, encountered landmarks, or spatial layout). These summaries are then scored for alignment with task objectives, allowing the Navigation Manager to select rollouts optimally matched to the instruction.

This mechanism explicitly incorporates imagination into policy selection, enabling the agent to anticipate consequences and plan accordingly—analogous to “mental simulation” in biological systems.

4. Zero-Shot Protocol and Performance Results

DreamNav is evaluated under strict zero-shot VLN-CE protocols, using only egocentric RGB-D input without explicit panoramic or map context. Performance is measured primarily using:

Success Rate (SR): Fraction of episodes where the goal is reached within a set distance (e.g., 3m).
Success weighted by Path Length (SPL): Ratio of optimal length to actual trajectory length, weighted by success.
Additional metrics: Trajectory Length (TL), Navigation Error (NE), Oracle Success Rate (OSR).

On R2R-CE, DreamNav outperforms the strongest panoramic baseline (InstructNav; +1.79% SR, +4.95% SPL) and the best egocentric baseline (CA-Nav; +7.49% SR, +18.15% SPL), even though CA-Nav uses extra odometry and panoramic information. In real-world tests, DreamNav’s success rate exceeds that of prior baselines (e.g., 12/20 successes vs. 6/20 for CA-Nav).

5. Comparison to Conventional and Contemporary Baselines

DreamNav demonstrates several distinguishing advantages relative to existing zero-shot VLN approaches:

Aspect	Conventional Baselines	DreamNav Approach
Perception	Panoramic or egocentric	Egocentric with EgoView Corrector
Action Granularity	Point-level	Trajectory-level (global planning)
Foresight/Imagination	Passive (none)	Active Imagination Predictor
Planning Horizon	Short-sighted	Long-horizon, instruction-aligned
Zero-shot Capability	Varies	Unified, strong zero-shot SOTA

Unlike methods that rely on expensive panoramic inputs or passive scene understanding—often resulting in high computational and sensory costs or short-sighted planning—DreamNav leverages low-cost egocentric inputs and fuses proactive imagination with trajectory-level expert planning.

6. System Implications and Prospective Directions

DreamNav’s unified imaginative-perceptual framework demonstrates that long-horizon, semantically guided navigation can be achieved efficiently and robustly with only egocentric vision and without specialized retraining on navigation tasks. This suggests that future embodied AI agents can exploit pretrained foundation models for perception, reasoning, and imagination in a zero-shot regime, significantly lowering deployment costs and data requirements.

A plausible implication is that proactive, trajectory-level planning with integrated imagination may enable robotics and assistive agents to generalize more robustly in the wild. The synthesis of corrective perception, multimodal trajectory generation, and active world modeling sets a strong precedent for further research on scalable, data-efficient, and generalizable navigation policies in continuous and real-world environments.

7. Limitations and Extensions

While DreamNav achieves strong empirical results, several limitations are noted:

Imagination quality, especially visual rollouts, depends on the fidelity and generalizability of the world model. Errors in prediction or misalignment with rare or out-of-distribution scenarios may affect downstream performance.
The hierarchical perception correctors may incur latency in ambiguous or highly dynamic scenes.
The diffusion-based trajectory sampling, coupled with rollout imagination, can become computationally expensive if the number of candidate trajectories or rollout horizon is large.

A proposed extension is to combine “imagination”-driven reward modeling and further optimize memory management—augmenting the framework with inverse reinforcement learning to estimate rewards for unseen states, as explored in foundational generative RL work (Andersen et al., 2018). This could further enhance sample efficiency and robustness.

DreamNav represents the synthesis of egocentric perception stabilization, global trajectory policy generation, and proactive imagination, setting a state-of-the-art foundation for zero-shot vision-and-language navigation under continuous, real-world constraints (Wang et al., 14 Sep 2025).