Spatiotemporally Grounded Multimodal History for Long-Horizon UAV VLN

Develop a unified method to elevate multimodal historical information—specifically historical visual observations and flight trajectory history—from static memory into a spatiotemporally grounded context that is tightly coupled with natural language instructions and navigation decisions for long-horizon unmanned aerial vehicle vision-and-language navigation.

Background

The paper reviews recent efforts in UAV vision-and-language navigation that incorporate memory mechanisms and large models to address long-horizon dependencies. While both trajectory history and visual history have been recognized as essential, existing approaches commonly model these histories separately and treat them as static cues, failing to align them with language instructions and the spatiotemporal structure of the navigation process.

This limitation leads to inconsistent spatiotemporal context, semantic drift, and unstable planning in complex environments over long distances. The authors therefore identify as a key open problem the need for a unified, instruction-aligned, spatiotemporally grounded representation of multimodal historical information to support robust long-horizon UAV VLN.

References

Overall, how to elevate multimodal historical information from static memory to a spatiotemporally grounded context that is tightly coupled with language and navigation remains a key open problem in long-horizon UAV VLN.

LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration  (2512.22010 - Jiang et al., 26 Dec 2025) in Section 2.2 (Long-horizon vision-and-language navigation for UAVs)