Temporal Encoding for VLM-Based View Selection

Develop an effective representation and encoding strategy for the continuously expanding stream of egocentric visual observations that enables pre-trained vision-language models to perform robust temporal reasoning when selecting future viewpoints for mapless open-vocabulary visual navigation.

Background

The paper reformulates open-vocabulary visual navigation as an imagination-powered best-view selection problem for vision-LLMs (VLMs). A critical bottleneck identified is how to incorporate long-term observation history so that pre-trained VLMs, which are not designed for continuous 3D spatial reasoning, can nonetheless reason temporally over past context when choosing among imagined future views.

Prior approaches often either (i) summarize frames into text and aggregate captions over time, which risks losing important spatial details, or (ii) directly feed long image sequences to VLMs, which struggles with long-term dependencies and robustness. The authors propose a Selective Foveation Memory as one solution, but explicitly flag the broader challenge of encoding expanding observation streams for temporal reasoning in VLM-based navigation.

References

A major open challenge that arises in the context of VLM-based view selection is how to effectively encode the continuously expanding stream of observations to endow pre-trained VLMs with temporal reasoning capabilities.

ImagineNav++: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination (2512.17435 - Wang et al., 19 Dec 2025) in Section 1, Introduction