Vision and Language Navigation in Continuous Environments

Updated 21 September 2025

VLN-CE is defined as autonomous navigation using natural language in continuous, photorealistic 3D environments with egocentric observations and incremental motion commands.
Contemporary modeling approaches fuse multimodal representations, spatial memory, and hierarchical planning to overcome challenges like localization uncertainty and collision avoidance.
Advances in VLN-CE have led to robust embodied agents that integrate world modeling and lookahead planning, significantly improving success rates in unseen settings.

Vision-and-Language Navigation in Continuous Environments (VLN-CE) is a foundational embodied AI problem in which autonomous agents interpret natural language instructions and execute sequences of low-level actions to navigate through realistic, continuous 3D environments. Formalized to address the limitations of earlier discrete nav-graph settings (where navigation was reduced to transitions among a network of panoramic nodes), VLN-CE eliminates strong structural assumptions such as perfect localization, known connectivity graphs, and oracle subgoal transitions. Instead, agents rely solely on egocentric, partial-view RGB-D observations to plan and act, directly confronting the challenges of spatial ambiguity, high-frequency low-level control, collision avoidance, and accumulating uncertainty over long trajectories. This paradigm shift from graph-based abstraction to continuous state and perception spaces has catalyzed a substantial body of research into robust, generalizable, and semantically grounded navigation policies, as exemplified in leading works (Krantz et al., 2020, Hong et al., 2022, An et al., 2023, Wang et al., 2023, Wang et al., 2 Apr 2024, Zhang et al., 19 Aug 2024, Dai et al., 25 Nov 2024, Chen et al., 13 Dec 2024, Shi et al., 13 Mar 2025, Yue et al., 14 Apr 2025).

1. Defining the VLN-CE Paradigm

VLN-CE is characterized by the agent’s requirement to follow natural language route instructions in photorealistic, physics-based simulation environments such as Habitat+Matterport3D, executing a sequence of motion primitives (e.g., move forward 0.25 m, turn left 15°) from arbitrary, continuous locations. The continuous setting lifts key constraints imposed by discrete nav-graph paradigms:

No known environment topology: The agent receives no a priori navigation graph or spatial map.
No “perfect” localization: Localization must emerge from onboard proprioception and egocentric visual cues.
No “teleportation” between subgoals: Navigation is accomplished through a series of incremental, low-level actions subject to physical constraints (collision, occlusions, motion noise).
No oracle stopping criteria: The agent alone must determine both path-following and task completion.

Table 1 summarizes the fundamental contrasts:

Feature	Discrete Nav-Graph VLN	Continuous VLN-CE
Topology	Predefined graph	Unknown, unconstrained
Control	High-level hops	Low-level motions
Localization	Oracle, global	Egocentric, partial
Observations	360° panoramic + map	Narrow FOV, noisy
Target success	Graph node stopping	Free stopping, metric

This conceptual redefinition exposes the brittleness of previously reported VLN progress and grounds the setting in the realities facing real-world mobile robotics (Krantz et al., 2020).

2. Modeling Approaches and Fusion Architectures

Contemporary VLN-CE agents combine advances in multimodal representation learning, spatial memory, cross-modal attention, topological abstraction, and world modeling to meet the challenges imposed by continuous operation.

Seq2Seq and Cross-Modal Attention Models:

Significant baselines include LSTM-based sequence models that integrate mean-pooled RGB/Depth features and language-encoded state vectors, with action selection governed by a recurrent policy (formula: $h_t^{(a)} = \text{GRU}([\,\bar{v}_t,\,\bar{d}_t,\,s\,],\,h_{t-1}^{(a)})$ and $a_t = \arg\max_a \mathrm{softmax}(W_a h_t^{(a)}+b_a)$ ) (Krantz et al., 2020). Cross-modal attention introduces dynamic alignment, with perception and language streams attending over each other (formulas: $\,\hat{s}_t = \mathrm{Attn}(S, h_t^{(\text{attn})})$ ; $\,\hat{v}_t, \hat{d}_t = \mathrm{Attn}(\mathcal{V}_t/\mathcal{D}_t, \hat{s}_t)$ ). These early models enable ablations isolating the impact of modalities (vision vs. language vs. depth).

Semantic Mapping and Memory Augmented Models:

More advanced models (e.g., SASRA (Irshad et al., 2021)) introduce semantic top-down mapping using projective fusion of RGB-D data and semantic segmentation, creating spatially grounded memory modules. These are aligned with syntactic instruction structure via transformer-based cross-modal attention, and temporal context is maintained through recurrent processing and positional encoding. The architecture fuses spatially distributed semantic representations with linguistic context, achieving significant gains in SPL (+22%) in unseen environments.

Candidate Waypoint Predictors and Hierarchical Planning:

To reconcile the “discrete-to-continuous” gap, candidate waypoint predictors (as in (Hong et al., 2022)) process panoramic RGB-D to produce a heatmap $H_{\text{pred}}(\theta, d)$ over local egocentric coordinates, with high-quality supervision provided by refined connectivity graphs. This enables high-level “discretization” within continuous space, yielding large SPL improvements (e.g., +18.24% over low-level policies). Hierarchical planners (e.g., ETPNav (An et al., 2023)) build online topological maps of traversed space, using transformer-based cross-modal reasoning and a trial-and-error low-level controller to avoid obstacles and deadlocks.

3. World Modeling, Lookahead, and Mental Simulation

Recent advances focus on explicit world models and future anticipation to address exploration and partial observability.

Mental Planning and World-Model Agents:

DREAMWALKER (Wang et al., 2023) constructs an episodic environment graph augmented with topological/geometric encodings and a scene synthesizer that predicts plausible future observations from novel viewpoints. Action selection is performed by mental planning using MCTS over the internal model, guided by distance functions parameterized via graph attention conditioned on instructions. This yields interpretable, globally-informed trajectories and superior navigation metrics (+5–7% SR over state-of-the-art).

Neural Radiance and Future View Synthesis:

The HNR model (Wang et al., 2 Apr 2024) extends this “lookahead” paradigm by representing the agent’s perceptual knowledge as a 3D feature cloud, encoding both regional and panoramic semantic structure via hierarchical volume rendering. For a candidate action, the agent predicts region-level latent features $R_{h,w}$ and aggregates these for path evaluation. High gains in SR and SPL (e.g., 4% improvement on val-unseen) demonstrate the value of efficient, robust semantic machine vision of unseen space.

3D Gaussian Splatting and Joint Appearance-Semantic Fusion:

UnitedVLN (Dai et al., 25 Nov 2024) unifies two rendering paths—fast 3DGS-based photorealistic synthesis and NeRF-based semantic rendering—via “search-then-query” neural primitive selection and separate-then-united fusion. The approach fuses panoramic RGB and feature embeddings through cross-attention, producing a rich, robust representation for downstream navigation, yielding improved NE, SR, and SPL.

4. Specialized Modules: Collision Avoidance, Error Handling, and Continual Learning

VLN-CE systems increasingly incorporate explicit mechanisms for operational safety, robustness to imperfect language, and lifelong adaptation.

Collision Avoidance:

Safe-VLN (Yue et al., 2023) augments waypoint prediction with 2D LiDAR-derived occupancy masking (formula: $H^*_t = \mathrm{norm}(H_t + \delta\,M_t)$ ), penalizing candidate locations within obstacles. A re-selection navigator maintains backup candidates, rapidly recovering from navigation collisions via a multi-waypoint selection and replanning strategy. This design significantly reduces collision rates and improves success in benchmarks.

Instruction Error Detection and Localization:

The limitation of assuming perfect language is addressed in (Taioli et al., 15 Mar 2024), which injects synthetic instruction errors and defines detection/localization tasks. A cross-modal transformer model fuses trajectory and language features, producing token-level error predictions and achieving high AUC and precise localization under real-world ambiguities.

Continual and Zero-Shot Learning:

The CVLN (Jeong et al., 22 Mar 2024) paradigm exposes catastrophic forgetting when agents train sequentially on domain streams. Perplexity Replay and Episodic Self-Replay rehearsals balance decisions across evolving distributions. Constraint-aware zero-shot navigation (Chen et al., 13 Dec 2024) introduces a sub-instruction manager and semantic value mapping, guiding progression and local value refinement in training-free regimes, outperforming previous methods by 12–13% SR in unseen splits.

5. Emerging Topics: Cognitive Memory, LLM-based Planning, and Self-Evolving World Models

Advances in externalized memory representations and integration of LLMs are defining the next generation of VLN-CE systems.

Cognitive Mapping and Reflective Planning:

Cog-GA (Li et al., 4 Sep 2024) constructs a structured, graph-based cognitive map combining spatial–temporal labeling with dual-channel (“what,” “where”) scene descriptions. Waypoint selection and reflection-driven memory updates enable interpretable, human-like planning, yielding state-of-the-art SR (48%) and OSR (59%).

LVLMs for End-to-End Navigation:

Frameworks such as VLN-R1 (Qi et al., 20 Jun 2025) and modular plug-and-play systems (Duan et al., 11 Jun 2025) leverage LVLMs (e.g., Qwen2.5-VL-7B), integrating vision-language perception with lightweight, decoupled planners. Reinforcement fine-tuning with time-decayed rewards promotes long-horizon action selection, with Group Relative Policy Optimization (GRPO) for efficient, critic-free training. Comprehensive prompt engineering and historical context encoding enable action sequence generation in continuous space, closely matching or exceeding supervised large-model baselines.

Self-Evolving World Models:

NavMorph (Yao et al., 30 Jun 2025) proposes variational latent models that balance model expressiveness via ELBO (using Jensen’s inequality) with on-policy, no-regret adaptation (Ross et al., 2011). Through compact scene representations and contextual evolution memory, agents dynamically update environmental models and navigation behaviors as the scene evolves.

6. Evaluation Protocols, Metrics, and Limitations

VLN-CE evaluation employs rigorously defined metrics:

Metric	Description
Navigation Error (NE)	Final distance to goal (meters)
Success Rate (SR)	% of runs ending within threshold (e.g., ≤3m) of goal
Oracle Success Rate (OSR)	SR with idealized stopping
SPL	SR weighted by ratio of shortest to actual path length
nDTW/SDTW	(RxR-CE) Trajectory fidelity, success-weighted DTW

Major findings highlight significantly lower absolute performance for all models in continuous settings vs. discrete graph settings, pronounced sensitivity to language and perceptual noise, and high variance in out-of-distribution generalization (Krantz et al., 2020, Taioli et al., 15 Mar 2024). Despite rapid progress, limitations remain regarding long-horizon exploration, elevation changes, and multi-agent robustness.

7. Significance and Research Impact

VLN-CE has catalyzed:

The development of multi-stage, multi-modal reasoning systems that jointly optimize for language grounding, spatial planning, and safe actuation.
A shift toward embodied world models and explicit future prediction, as opposed to purely myopic policies.
Emergence of modular and scalable pipelines leveraging foundation models, structured priors, and self-supervised adaptation.
Real-world robot deployments demonstrating effective zero-shot transfer and online adaptation, with benchmarks such as R2R-CE and RxR-CE as community standards.

Current research trajectories aim to integrate richer environmental models, augment robustness to linguistic/perceptual uncertainty, and realize persistent, adaptive agents capable of lifelong learning and scalable deployment in unconstrained 3D spaces.

References

(Krantz et al., 2020, Irshad et al., 2021, Hong et al., 2022, Krantz et al., 2022, Wang et al., 2023, An et al., 2023, Wang et al., 2023, Yue et al., 2023, Taioli et al., 15 Mar 2024, Jeong et al., 22 Mar 2024, Wang et al., 2 Apr 2024, Zhang et al., 19 Aug 2024, Li et al., 4 Sep 2024, Dai et al., 25 Nov 2024, Chen et al., 13 Dec 2024, Shi et al., 13 Mar 2025, Yue et al., 14 Apr 2025, Duan et al., 11 Jun 2025, Qi et al., 20 Jun 2025, Yao et al., 30 Jun 2025)