ImagineNav++: Mapless VLM Navigation

Updated 4 July 2026

ImagineNav++ is a mapless, imagination-driven framework for open-vocabulary indoor navigation that leverages a frozen vision-language model to predict future observations.
The system reformulates long-horizon navigation as an image selection problem by synthesizing multiple candidate views and using compact visual memory to guide decision-making.
Key modules like Where2Imagine and PolyOculus generate human-like trajectory predictions and imagined views, enhancing navigation efficiency and success in unseen indoor environments.

ImagineNav++ is a mapless, imagination-driven framework for open-vocabulary goal-oriented visual navigation that prompts a frozen vision-LLM (VLM) with imagined future observations and a compact visual memory, using only onboard RGB/RGB-D streams and without fine-tuning the VLM (Wang et al., 19 Dec 2025). It addresses both category-level ObjectNav and instance-level navigation specified by a reference image, reformulating long-horizon navigation in previously unseen indoor environments as a sequence of visually grounded best-view selection decisions rather than explicit map construction or text-only spatial planning.

1. Problem setting and design motivation

ImagineNav++ targets indoor navigation tasks in which the robot must search for an arbitrary object goal specified either by a category label, as in ObjectNav, or by a reference image, as in InsINav (Wang et al., 19 Dec 2025). The robot starts from a random pose in a previously unseen environment, has no ground-truth 3D coordinates of the goal, and acts through discrete primitive actions such as MoveAhead, TurnLeft, and Stop. The system is explicitly open-vocabulary: target categories or instances may not have been seen during training.

The framework is motivated by limitations of the common mapping $\rightarrow$ language translation $\rightarrow$ LLM planning $\rightarrow$ path planning pipeline. In that pipeline, SLAM and semantic perception can accumulate error, continual dense detection and segmentation are computationally expensive, and translating spatial structure into text discards metric occupancy, occlusions, free-space structure, and fine object appearance (Wang et al., 19 Dec 2025). The paper therefore treats text-only planning as a poor interface for navigation decisions that depend on visual geometry.

This design choice is especially salient for instance-conditioned navigation. Instance ImageGoal Navigation requires the agent to reach the same physical object instance depicted in a goal image under viewpoint changes and in the presence of distractors, rather than merely matching category or coarse view similarity (Lei et al., 2024). ImagineNav++ does not adopt explicit instance verification modules of the kind used in IEVE, but it addresses the same broad problem class by pushing goal reasoning into VLM-guided visual selection (Lei et al., 2024).

2. Reformulation as imagined-view selection

The central idea of ImagineNav++ is to recast high-level navigation as an image selection problem that a VLM can solve directly (Wang et al., 19 Dec 2025). Instead of asking the VLM to output waypoints, paths, or textual plans, the framework first imagines several candidate future observations from plausible robot viewpoints. These imagined views are then presented to the VLM together with the goal and a compact history memory, and the VLM is prompted to answer a multiple-choice question: which candidate future view is the most informative to move toward?

The VLM returns a structured JSON output containing "Choice" and "Reason". In this formulation, the VLM is used as a visual discriminator over candidate scenes, a semantic matcher for either category labels or reference images, and a commonsense selector that can prefer doors, openings, or room transitions without performing explicit 3D geometry computation. The selected candidate viewpoint is converted into a local point-goal, and the original long-horizon navigation problem is reduced to a sequence of point-goal navigation subproblems.

This differs from map-based visual navigation systems that maintain explicit renderable or semantic world models. For example, IGL-Nav localizes the goal in an incremental 3D Gaussian representation through coarse-to-fine 3D-aware pose search (Guo et al., 1 Aug 2025), while RNR-Nav localizes over a renderable BEV latent map with correlation and a particle filter (Kim et al., 2024). ImagineNav++ instead remains mapless and delegates the main high-level decision to imagined visual prompts (Wang et al., 19 Dec 2025).

3. Future-view imagination

The future-view imagination module consists of two parts: Where2Imagine, which predicts where a human-like navigator would move next, and PolyOculus, which synthesizes the corresponding future observation (Wang et al., 19 Dec 2025).

Where2Imagine is trained from human demonstration trajectories in Habitat-Web, using approximately 80k ObjectNav trajectories and 12k Pick&Place trajectories in MP3D scenes. Training pairs are constructed as

$\{(O_t, P_{t+T})\},$

where $O_t$ is the RGB observation at time $t$ , and

$P_{t+T} = (\Delta x, \Delta y, \Delta \theta)$

is the relative pose from frame $t$ to a future frame at time $t+T$ . To remove uninformative views, the training set retains $O_t$ only if

$\rightarrow$ 0

where $\rightarrow$ 1 is the average depth, and excludes large rotations by enforcing

$\rightarrow$ 2

The waypoint regressor is a ResNet-18 trained from scratch with loss

$\rightarrow$ 3

At navigation time, the robot captures six panoramic views $\rightarrow$ 4, and each is mapped to a candidate relative waypoint: $\rightarrow$ 5

Given the current RGB view $\rightarrow$ 6 and predicted relative pose $\rightarrow$ 7, the system applies the pre-trained diffusion-based novel view synthesis model PolyOculus to generate an imagined future observation $\rightarrow$ 8. PolyOculus is used without fine-tuning on Gibson, HM3D, or HSSD (Wang et al., 19 Dec 2025). The resulting imagined views are not assumed to be pixel-perfect; rather, they are expected to preserve enough global structure and semantics for the VLM to reason over them.

The paper reports that the predicted waypoints cluster in semantically meaningful directions such as doors, openings, and corridors, whereas naive uniform sampling often points toward walls or other uninformative regions (Wang et al., 19 Dec 2025). This suggests that the role of Where2Imagine is not merely kinematic prediction, but distillation of human exploration preferences into a compact visual proposal mechanism.

4. Selective foveation memory

To support long-horizon navigation without overwhelming the VLM context window, ImagineNav++ introduces Selective Foveation Memory, a sparse-to-dense keyframe memory inspired by foveated perception (Wang et al., 19 Dec 2025). Its purpose is to preserve long-range global landmarks, medium-term scene context, and dense recent visual detail in a single compact history.

Given historical observations

$\rightarrow$ 9

the system computes DINOv2 features

$\rightarrow$ 0

and uses cosine similarity between adjacent frames,

$\rightarrow$ 1

to segment the trajectory into semantic segments. For a segment

$\rightarrow$ 2

with features $\rightarrow$ 3, the segment centroid is

$\rightarrow$ 4

and the representative keyframe is chosen by

$\rightarrow$ 5

The memory is organized into three temporal bands: recent, medium, and distant. Different similarity thresholds are used for each band so that recent history is stored densely and distant history sparsely. In the final configuration, the thresholds are $\rightarrow$ 6 for recent, medium, and distant history, yielding an average memory size of approximately $\rightarrow$ 7 keyframes over $\rightarrow$ 8 steps (Wang et al., 19 Dec 2025).

The quantitative effect is explicit. On HM3D ObjectNav with NVS, the paper reports:

No memory: SR $\rightarrow$ 9, SPL $\{(O_t, P_{t+T})\},$ 0
Full memory (all frames): SR $\{(O_t, P_{t+T})\},$ 1, SPL $\{(O_t, P_{t+T})\},$ 2
Uniform keyframe thresholds $\{(O_t, P_{t+T})\},$ 3: SR $\{(O_t, P_{t+T})\},$ 4, SPL $\{(O_t, P_{t+T})\},$ 5
Selective foveation $\{(O_t, P_{t+T})\},$ 6: SR $\{(O_t, P_{t+T})\},$ 7, SPL $\{(O_t, P_{t+T})\},$ 8 (Wang et al., 19 Dec 2025)

These results indicate that compression alone is not sufficient; the specific sparse-to-dense temporal allocation matters. The reported behavior is that keyframes concentrate on intersections, corners, and room transitions, which allows the VLM to recognize loops and reduce repeated wandering.

At the beginning of an episode, the agent is initialized at a random pose, receives either a category goal or a reference image, and starts with an empty memory (Wang et al., 19 Dec 2025). At each high-level planning cycle, executed every $\{(O_t, P_{t+T})\},$ 9 steps, it captures a $O_t$ 0 panorama as six RGB views $O_t$ 1. For each view, Where2Imagine predicts a relative future pose, PolyOculus synthesizes the corresponding imagined view, and the selective foveation memory is updated from recent observations.

The VLM prompt contains four elements: the goal, the memory keyframes, the six imagined future views, and an instruction asking which candidate is the best choice for finding the goal. The VLM output is parsed, and the selected candidate’s relative pose becomes a 2D sub-goal for the low-level controller. The framework therefore turns goal-oriented navigation into repeated local goal reaching.

For low-level execution, the system uses the VER PointNav policy. The controller maps the current observation and selected sub-goal into primitive actions from the discrete set

$O_t$ 2

Episodes are capped at $O_t$ 3 steps. Success is recorded when the agent is within $O_t$ 4 m geodesic distance of the target and issues Stop (Wang et al., 19 Dec 2025). Path efficiency is measured by

$O_t$ 5

where $O_t$ 6 is the success indicator, $O_t$ 7 the realized path length, and $O_t$ 8 the shortest path.

The framework uses GPT-4o-mini as its default VLM in the final system (Wang et al., 19 Dec 2025). In ablations, GPT-4o-mini yields HM3D ObjectNav SR $O_t$ 9 and SPL $t$ 0 in the Oracle + memory setting, compared with SR $t$ 1 and SPL $t$ 2 for GPT-4-Turbo and SR $t$ 3 and SPL $t$ 4 for LLaVA. The paper also reports an approximate token cost of $t$ 5 for GPT-4-Turbo (Wang et al., 19 Dec 2025).

6. Empirical performance, ablations, and relation to adjacent methods

ImagineNav++ is evaluated in Habitat 3.0 on ObjectNav and InsINav benchmarks (Wang et al., 19 Dec 2025). ObjectNav experiments cover Gibson, HM3D, and HSSD; InsINav experiments follow HM3D protocols from PSL and UniGoal. Main results are reported over $t$ 6 episodes per dataset, with $t$ 7-episode ablations.

Benchmark	ImagineNav++	Selected reference points
Gibson ObjectNav	SR $t$ 8, SPL $t$ 9	Oracle: SR $P_{t+T} = (\Delta x, \Delta y, \Delta \theta)$ 0, SPL $P_{t+T} = (\Delta x, \Delta y, \Delta \theta)$ 1
HM3D ObjectNav	SR $P_{t+T} = (\Delta x, \Delta y, \Delta \theta)$ 2, SPL $P_{t+T} = (\Delta x, \Delta y, \Delta \theta)$ 3	VLFM $P_{t+T} = (\Delta x, \Delta y, \Delta \theta)$ 4, UniGoal $P_{t+T} = (\Delta x, \Delta y, \Delta \theta)$ 5, SG-Nav $P_{t+T} = (\Delta x, \Delta y, \Delta \theta)$ 6
HSSD ObjectNav	SR $P_{t+T} = (\Delta x, \Delta y, \Delta \theta)$ 7, SPL $P_{t+T} = (\Delta x, \Delta y, \Delta \theta)$ 8	Oracle: SR $P_{t+T} = (\Delta x, \Delta y, \Delta \theta)$ 9, SPL $t$ 0
HM3D InsINav	SR $t$ 1, SPL $t$ 2	UniGoal $t$ 3, Mod-IIN $t$ 4, PSL $t$ 5

On HM3D ObjectNav, ImagineNav++ reports higher success rate than the listed map-based open-vocabulary methods VLFM, UniGoal, Goat, and SG-Nav, while remaining mapless and without VLM fine-tuning (Wang et al., 19 Dec 2025). On HSSD, it achieves the best listed SR and SPL among the methods reported in the paper. On HM3D InsINav, it achieves state-of-the-art SPL at $t$ 6, even relative to map-based methods such as Mod-IIN, Goat, and UniGoal, and substantially exceeds PSL among methods that are open-vocabulary, mapless, and universal (Wang et al., 19 Dec 2025).

The ablation studies isolate three effects. First, imagination itself matters: on HM3D, removing imagination yields ObjectNav SR $t$ 7, SPL $t$ 8, and InsINav SR $t$ 9, SPL $t+T$ 0, while adding multiple future views produces large gains (Wang et al., 19 Dec 2025). Second, Where2Imagine matters: in the Oracle setting without memory, fixed-offset views give ObjectNav $t+T$ 1, whereas Where2Imagine gives $t+T$ 2. Third, memory matters: in the same Oracle regime, adding memory raises ObjectNav to $t+T$ 3 and InsINav to $t+T$ 4. With NVS, the final ablation configuration reaches $t+T$ 5 on HM3D ObjectNav and $t+T$ 6 on HM3D InsINav (Wang et al., 19 Dec 2025).

The paper also reports a backbone ablation for Where2Imagine in which ResNet-18 trained from scratch attains the best HM3D ObjectNav SR of $t+T$ 7 in the Oracle, no-memory setting, outperforming ViT, DINOv2, and MAE variants in that study (Wang et al., 19 Dec 2025). This result is presented as evidence that explicit learning of human navigation habits is more effective than simply adopting a stronger generic visual backbone.

Several limitations are stated explicitly. ImagineNav++ depends on NVS quality; hallucinated future views can mislead the VLM, and the Oracle variant indicates nontrivial headroom (Wang et al., 19 Dec 2025). The framework also incurs nontrivial runtime cost because NVS and VLM inference are both on the critical path. In addition, the system avoids explicit mapping, which simplifies deployment but implies approximate geometry rather than a precise global metric map. Evaluation is conducted in Habitat simulation rather than on a real robot (Wang et al., 19 Dec 2025).

Within the broader literature, ImagineNav++ occupies a distinct point in the design space. IEVE emphasizes explicit verification for instance discrimination under viewpoint change (Lei et al., 2024); IGL-Nav uses incremental 3D Gaussian localization for image-goal navigation, including free-view settings (Guo et al., 1 Aug 2025); GauScoreMap performs hierarchical scoring over a 3D Gaussian map for instance image-goal navigation (Deng et al., 9 Jun 2025); and SGImagineNav shifts imaginative navigation toward symbolic hierarchical scene graphs and LLM/VLM-based region prediction (Hu et al., 9 Aug 2025). A later line of work, AnyImageNav, pushes image-goal navigation toward precise last-meter pose recovery with a semantic-to-geometric cascade and exact 6-DoF localization (Deng et al., 7 Apr 2026). Against that background, ImagineNav++ is most accurately characterized as a mapless VLM navigation system that treats scene imagination and compact visual memory as the primary interface between perception and high-level decision making (Wang et al., 19 Dec 2025).