LongFly: Aerial UAV Vision-Language Navigator

Updated 4 July 2026

LongFly is a spatiotemporal context modeling framework that compresses historical UAV observations into structured semantic memory for long-horizon navigation.
It integrates slot-based visual memory and explicit trajectory encoding with a multimodal large language model to fuse historical and current data.
The framework improves success rates and reduces navigation errors compared to state-of-the-art UAV VLN baselines in complex outdoor environments.

Searching arXiv for the LongFly paper and closely related aerial VLN work. LongFly is a spatiotemporal context modeling framework for long-horizon unmanned aerial vehicle vision-and-language navigation (UAV VLN). It addresses the setting in which a UAV must follow natural-language instructions in complex 3D outdoor environments characterized by high information density, rapid changes in viewpoint, and dynamic structures, with particular emphasis on long-horizon navigation. The framework models historical observations and trajectory evolution explicitly through a history-aware spatiotemporal modeling strategy that transforms fragmented and redundant historical data into structured, compact, and expressive representations, and integrates those representations with current observations through a multimodal LLM to predict continuous 3D waypoints (Jiang et al., 26 Dec 2025).

1. Definition and research context

LongFly was introduced to address a specific weakness in prior UAV VLN systems: current methods struggle to model long-horizon spatiotemporal context in complex environments, which leads to inaccurate semantic alignment and unstable path planning. In this formulation, the central problem is not merely visual grounding or short-range control, but maintaining semantically consistent and temporally coherent navigation over trajectories that may extend to 400 m in outdoor 3D scenes (Jiang et al., 26 Dec 2025).

The framework is positioned within aerial VLN rather than generic embodied navigation. The relevant task differs from standard indoor VLN in three respects stated in the literature: observations are multi-view aerial images, the environment is large-scale and outdoor, and the agent must predict continuous 3D waypoints rather than discrete motion primitives. LongFly therefore targets a regime where history compression, temporal structure, and multimodal reasoning become first-order design constraints rather than auxiliary enhancements (Jiang et al., 26 Dec 2025).

LongFly also belongs to a broader methodological shift in aerial VLN. Closely related work such as OpenFly defines aerial VLN on a comprehensive platform with 100,000 trajectories across 18 scenes and emphasizes keyframe-aware navigation using a video-based vision-language-action model (Gao et al., 25 Feb 2025). LongFly differs in focusing specifically on long-horizon context modeling through slot-based visual memory, explicit trajectory encoding, and prompt-guided multimodal integration (Jiang et al., 26 Dec 2025).

2. Task formulation and operational setting

At time step $t$ , the LongFly agent receives a natural language instruction $L$ , the current UAV pose

$Q_t = [x_t, y_t, z_t, \varphi_t, \theta_t, \psi_t],$

current RGB observations

$R_t = \{R_t^1, \dots, R_t^5\},$

and historical observations consisting of images $\{R_1, \dots, R_{t-1}\}$ and past predicted waypoints $\{P_1, \dots, P_{t-1}\}$ . It must output the next continuous waypoint

$P_t = [x_t, y_t, z_t]$

in world coordinates, or a Stop signal. A navigation is successful if the final waypoint is within 20 m of the goal (Jiang et al., 26 Dec 2025).

The experimental setting uses AirSim in the OpenUAV benchmark environment. The OpenUAV dataset contains 12,149 human-operated trajectories, with trajectory length 50–400 m, multi-view RGB images per step, language goal descriptions refined by experts, waypoint sequences, and 89 object classes. LongFly evaluates both seen and unseen environments, with additional unseen-object and unseen-map splits, thereby treating long-horizon reasoning and generalization as core evaluation dimensions rather than secondary stress tests (Jiang et al., 26 Dec 2025).

A useful comparison is provided by OpenFly, which formalizes aerial VLN in a 3D continuous environment but uses a discrete action space: $\mathcal{A} = \{\text{Forward(3m)}, \text{Forward(6m)}, \text{Forward(9m)}, \text{Turn Left}, \text{Turn Right}, \text{Move Up}, \text{Move Down}, \text{Stop}\}.$ OpenFly’s benchmark uses NE, SR, OSR, and SPL with the same 20 m success threshold, but organizes navigation around macro-actions rather than continuous waypoint prediction (Gao et al., 25 Feb 2025). This distinction is central: LongFly is a waypoint-level planner for long-horizon aerial VLN, whereas OpenFly provides a broader platform and a discrete-action aerial VLN agent.

3. Core architecture

LongFly comprises three principal modules: Slot-based Historical Image Compression (SHIC), Spatiotemporal Trajectory Encoding (STE), and Prompt-Guided Multimodal Integration (PGM). Together, these modules compress long visual histories, represent trajectory dynamics explicitly, and fuse history with instruction and current observation through a Qwen2.5-3B multimodal LLM (Jiang et al., 26 Dec 2025).

SHIC compresses multi-view historical RGB observations into a fixed number $K$ of slots. Historical images are first encoded by CLIP ViT-L/14: $Z_i = \mathcal{F}_v(R_i),$ with slot memory

$L$ 0

Initialization is

$L$ 1

At each step, slots query historical tokens through cross-attention and are updated recurrently by a GRU: $L$ 2 The resulting representation is compact because inference retains only $L$ 3 slots per viewpoint rather than an ever-growing buffer of image tokens, while temporal consistency is preserved because slot identities persist over time (Jiang et al., 26 Dec 2025).

STE converts past waypoints into motion tokens. For waypoint history $L$ 4, LongFly computes displacements

$L$ 5

then factorizes each displacement into step length and unit direction: $L$ 6 These are concatenated as

$L$ 7

augmented with a time embedding $L$ 8,

$L$ 9

and encoded by an MLP: $Q_t = [x_t, y_t, z_t, \varphi_t, \theta_t, \psi_t],$ 0 This representation reduces sensitivity to global coordinates and makes local motion continuity explicit (Jiang et al., 26 Dec 2025).

PGM organizes language, compressed visual slots, trajectory tokens, and current observation into a structured prompt for Qwen2.5-3B. The instruction $Q_t = [x_t, y_t, z_t, \varphi_t, \theta_t, \psi_t],$ 1 is encoded by BERT and projected into Qwen’s 2048-dimensional latent space, while SHIC slots and STE tokens are projected into the same space. The prompt is conceptually written as

$Q_t = [x_t, y_t, z_t, \varphi_t, \theta_t, \psi_t],$ 2

The final conditioned input is

$Q_t = [x_t, y_t, z_t, \varphi_t, \theta_t, \psi_t],$ 3

and waypoint prediction is

$Q_t = [x_t, y_t, z_t, \varphi_t, \theta_t, \psi_t],$ 4

The prompt is explicitly structured into task, history, and current observation sections so that the MLLM can perform time-based reasoning rather than simple feature concatenation (Jiang et al., 26 Dec 2025).

4. Optimization, evaluation, and empirical results

LongFly is trained with supervised imitation learning and scheduled sampling. The implementation freezes the CLIP vision encoder, the BERT text encoder, and the Qwen backbone, while training SHIC, STE, projection layers to the 2048-dimensional space, and LoRA adapters inside Qwen. Optimization uses AdamW with initial learning rate $Q_t = [x_t, y_t, z_t, \varphi_t, \theta_t, \psi_t],$ 5, batch size 8, and ZeRO Stage-2. Training is conducted on 4× RTX 4090, and inference on 4× A40. Scheduled sampling uses decay frequency 3000 steps and decay ratio 0.75 (Jiang et al., 26 Dec 2025).

Evaluation uses Navigation Error (NE), Success Rate (SR), Oracle Success Rate (OSR), and Success weighted by Path Length (SPL). LongFly reports gains over state-of-the-art UAV VLN baselines on both seen and unseen environments, with the abstract summarizing improvements of 7.89% in success rate and 6.33% in success weighted by path length, consistently across both seen and unseen environments (Jiang et al., 26 Dec 2025).

Split	LongFly	Strong comparator
Test Seen Full	NE 60.02, SR 36.39%, OSR 65.87%, SPL 31.07%	NavFoM: NE 93.05, SR 29.17%, OSR 49.24%, SPL 25.03%
Test Unseen Full	NE 91.84, SR 24.19%, OSR 43.86%, SPL 20.84%	NavFoM: NE 118.34, SR 15.63%, OSR 30.46%, SPL 14.21%
Unseen Object Full	NE 66.74, SR 43.87%, OSR 64.56%, SPL 38.39%	NavFoM: NE 108.04, SR 29.83%, OSR 47.99%, SPL 27.20%
Unseen Map Full	NE 108.32, SR 11.27%, OSR 30.27%, SPL 9.32%	NavFoM: NE 125.10, SR 6.30%, OSR 18.95%, SPL 5.68%

The ablation studies isolate the contribution of each module. On the unseen split, the base system without structured spatiotemporal context yields NE 106.08, SR 13.99%, SPL 12.16%. Adding STE gives NE 102.62, SR 19.97%, SPL 17.10%, adding SHIC gives NE 99.24, SR 21.05%, SPL 18.15%, and combining both yields LongFly at NE 91.84, SR 24.19%, SPL 20.84%. This establishes that SHIC and STE are complementary, with SHIC showing slightly larger individual impact (Jiang et al., 26 Dec 2025).

Prompt structure is also critical. Removing prompt-guided organization and replacing it with simple concatenation degrades unseen performance from NE 91.84, SR 24.19%, SPL 20.84% to NE 102.45, SR 15.06%, SPL 13.26%. History length experiments further show that using all frames remains beneficial when compression is available: a 10-frame history yields SR 18.65% and SPL 16.21%, a 60-frame history yields SR 20.17% and SPL 17.16%, and all frames yield SR 24.19% and SPL 20.84% (Jiang et al., 26 Dec 2025).

5. Relation to benchmarks, memory models, and adjacent system components

LongFly is best understood in relation to two neighboring lines of work: aerial VLN platforms such as OpenFly and system-level components relevant to deployed UAV autonomy. OpenFly provides a comprehensive platform with Unreal Engine, GTA V, Google Earth, and 3D Gaussian Splatting, a versatile toolchain, and a large-scale benchmark for aerial VLN with 100k trajectories across 18 scenes. Its OpenFly-Agent is keyframe-aware and uses DINO-SigLIP, Llama-2 7B, motion-based keyframe selection, visual token merging, and a history bank of capacity $Q_t = [x_t, y_t, z_t, \varphi_t, \theta_t, \psi_t],$ 6 (Gao et al., 25 Feb 2025).

LongFly departs from that design in two specific ways. First, it predicts continuous 3D waypoints rather than discrete macro-actions. Second, it does not rely on a sparse keyframe bank alone; instead it uses all past steps but compresses them into fixed-size semantic slots and trajectory tokens. This suggests a different answer to the long-horizon memory problem: OpenFly emphasizes event- and landmark-aware keyframe retention, whereas LongFly emphasizes history compression into persistent semantic memory units and motion tokens (Jiang et al., 26 Dec 2025).

A common misconception is to interpret LongFly as a complete UAV autonomy stack. The published framework is narrower and more precise: it is a long-horizon UAV VLN model for instruction-conditioned waypoint prediction. Communication, low-level flight control, and onboard state estimation are outside its direct scope. The surrounding literature indicates plausible complementary subsystems for a deployed LongFly-style platform, but they are not part of the LongFly architecture itself. In FANET routing, RLPR uses relative speed, signal strength, energy, geographic distance, and a forwarding angle to reduce undesirable control messages and improve network lifetime in multi-source UAV networks (Usman et al., 2020). In insect-inspired onboard perception, FLIVVER uses monocular optic flow and acceleration to directly estimate absolute forward ground velocity and ranging, proposing a lightweight alternative to SLAM-like scale recovery for insect-sized robots (Lingenfelter et al., 2020). These neighboring systems occupy different layers of the UAV autonomy stack than LongFly’s waypoint-level VLN reasoning.

6. Limitations, misconceptions, and future directions

LongFly’s results remain well below the human upper bound. On Test Seen Full, the human upper bound is NE 14.15, SR 94.51%, and SPL 77.84%, leaving a substantial gap relative to LongFly’s NE 60.02, SR 36.39%, and SPL 31.07% (Jiang et al., 26 Dec 2025). The most difficult generalization regime is unseen maps rather than unseen objects, which the authors identify as evidence that environmental distribution shift is more challenging than object shift (Jiang et al., 26 Dec 2025).

The framework also incurs nontrivial computational overhead because it uses a 3B-parameter Qwen model. Although SHIC and STE are lightweight relative to the backbone, the use of Qwen2.5-3B makes real-time onboard UAV deployment a separate engineering problem rather than a resolved property of the published system (Jiang et al., 26 Dec 2025). OpenFly identifies an analogous limitation for Llama-2 7B–based aerial VLN models and explicitly calls for lighter, efficient VLN models for real-time UAV deployment (Gao et al., 25 Feb 2025).

Another limitation is scope. LongFly focuses on semantic and structural context but does not explicitly model dynamic obstacles, and it is evaluated in AirSim rather than on real UAV hardware. The paper’s implied future directions are therefore clear: porting the framework from simulation to real platforms, improving handling of dynamic environments, increasing training diversity to reduce unseen-map failure, and integrating waypoint-level planning with low-level control pipelines (Jiang et al., 26 Dec 2025).

In the current literature, LongFly’s significance lies in treating long-horizon aerial VLN as a spatiotemporal context integration problem. Its central contribution is not merely higher benchmark scores, but a specific representational claim: long, redundant histories can be compressed into a fixed-size semantic memory and an explicit motion history, then fused with language through a structured multimodal prompt to stabilize long-range waypoint prediction (Jiang et al., 26 Dec 2025).