LongFly: UAV Navigation Framework
- LongFly is a spatiotemporal framework that compresses historical multi-view UAV images, encodes trajectory data, and integrates multimodal language instructions for extended navigation tasks.
- It utilizes dedicated modules—SHIC, STE, and PGM—that work in sequence to efficiently capture visual history, spatial trajectories, and guidance cues in real time.
- The framework demonstrates significant improvements in waypoint prediction accuracy and navigation success, making it ideal for complex, post-disaster search and rescue operations.
LongFly is a spatiotemporal context modeling framework developed for long-horizon unmanned aerial vehicle (UAV) vision-and-language navigation (VLN) tasks, particularly under conditions prevalent in post-disaster search and rescue where high information density, dynamic structures, and rapidly varying viewpoints challenge existing methods. It introduces history-aware context modeling to overcome the limitations of inaccurate semantic alignment and unstable path prediction characteristic of prior VLN approaches in complex, long-horizon environments. At the core of LongFly are three interlinked modules: Slot-based Historical Image Compression (SHIC), Spatiotemporal Trajectory Encoder (STE), and Prompt-Guided Multimodal (PGM) integration, collectively optimized for robust waypoint prediction via efficient integration of spatiotemporal and instruction contexts (Jiang et al., 26 Dec 2025).
1. Architectural Components of LongFly
LongFly’s framework is organized into three sequential modules:
- Slot-based Historical Image Compression (SHIC): Converts sequences of past multi-view UAV images into a compact, fixed-size “slot” representation, supporting efficient retrieval of spatiotemporal information without memory growth over time.
- Spatiotemporal Trajectory Encoder (STE): Encodes the 3D coordinates of previously visited waypoints to capture trajectory dynamics and spatial structure relevant to the navigation task.
- Prompt-Guided Multimodal Integration (PGM): Fuses the compressed visual slots, encoded trajectory tokens, and language instruction to facilitate context-aware, temporally coherent waypoint reasoning and prediction.
This architectural pipeline is designed so that SHIC provides the compact visual memory, STE supplies trajectory context, and PGM achieves holistic multimodal integration, enabling temporally-aware decision-making over extended navigation horizons.
2. Slot-based Historical Image Compression (SHIC) Module
SHIC is designed to summarize unlimited sequences of multi-view observations into a fixed-length, expressive visual memory for subsequent multimodal integration:
- Inputs: For each past time step (), the UAV acquires RGB images (), corresponding to front, rear, left, right, and bottom cameras.
- Feature Extraction: Each image is processed by a CLIP-based ViT-L/14 visual encoder , producing tokens , with or $1024$.
- Slot Memory: For each view , SHIC maintains a slot set , where (default $32$) is the per-view slot count.
- Recurrent Slot Attention: At each past step, the slot memory is updated via attention between slots (queries) and visual tokens (keys/values), followed by GRU-based recurrence:
- Project slots and tokens via learned matrices (, , ):
- Compute per-slot attention weights:
- Aggregate updates for each slot:
- Update slots jointly via GRU:
where .
- Output: The final compressed memory after processing all past steps is (), a sequence-invariant representation of the entire history.
No additional loss term is applied to SHIC; its parameters are optimized end-to-end via the downstream navigation regression objective.
3. Hyperparameters, Inputs/Outputs, and Complexity
Key SHIC hyperparameters and their roles:
| Parameter | Typical Value(s) | Role/Trade-off |
|---|---|---|
| Number of slots () | $8$, $24$, $32$ | Larger captures more visual detail; costs more compute |
| Slot dimension () | $768$ or $1024$ | CLIP token size; affects memory and representation power |
| Number of views () | $5$ | Reflects sensor configuration (multi-camera UAV) |
| Attention scaling | Ensures stable gradients in softmax attention | |
| GRU hidden size | Matches slot dimension; shared across views |
Inputs to SHIC are past images , while the output is a compact slot matrix of size . The overall computational complexity per navigation step is , independent of the time horizon. The memory footprint for historical context is constrained to due to the constant-size slot representation, in stark contrast to for naive concatenation of past tokens. The effective compression ratio is approximately ; for , , , this yields a ratio of $153$.
A smaller may lead to loss of crucial visual details; a larger decreases compression efficiency. Empirical findings support as a favorable compromise.
4. SHIC Algorithmic Flow and Pseudocode
The SHIC procedure is outlined below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
for each view v in 1…V: S^v ← Φ^v # shape (K×d) for i = 1 to t–1: for each view v in 1…V: # Extract tokens from current image Z ← F_v(R_i^v) # Z shape (N_i×d) # Project queries, keys, values Q ← S^v W_q^⊤ # (K×d) Kmat ← Z W_k^⊤ # (N_i×d) Vmat ← Z W_v^⊤ # (N_i×d) # Compute attention and updates for k = 1 to K: for j = 1 to N_i: α_{k,j} = softmax_j(Q[k] · Kmat[j] / √d) \hat s_k = sum_j(α_{k,j} * Vmat[j]) # GRU update S^v ← GRU(prev_state=S^v, input={\hat s_1,…,\hat s_K}) |
This process ensures that at each navigation step, the historical multi-view images are progressively distilled into a set of slots, each recurrently updated and operated on independently per view, and ultimately concatenated to form the visual memory available for high-level multimodal reasoning.
5. Integration with Spatiotemporal and Instructional Context
After SHIC, the slot-based visual summary is linearly projected and provided as input to the Prompt-Guided Multimodal (PGM) integration module alongside trajectory tokens from the Spatiotemporal Trajectory Encoder and textual instructions. The integration is performed using a Qwen-based LLM backbone, supporting robust, temporally-informed waypoint prediction. The training objective for the entire system is to minimize a regression loss on next-waypoint coordinates, optionally incorporating an instruction-alignment loss as part of the multimodal large language modeling (MLLM) component. No loss is explicitly attributed to the SHIC module; all parameters are updated end-to-end for downstream navigation performance (Jiang et al., 26 Dec 2025).
6. Empirical Performance and Applications
LongFly demonstrates significant improvements over prior UAV VLN methods, outperforming state-of-the-art baselines by 7.89% in success rate and 6.33% in success weighted by path length. Gains are sustained across both seen and unseen environments, emphasizing the framework’s robustness and generalization. Applications are concentrated in post-disaster search and rescue but extend to any long-horizon UAV navigation scenarios requiring efficient, context-aware spatial reasoning.
7. Considerations and Trade-offs
The slot-based compression strategy in SHIC enables unbounded temporal context with a fixed computational and memory budget. A recognized trade-off is the choice of slot count : inadequate capacity results in lossy scene summarization, while excessive slots undermine compression efficiency. The design balances dynamic context retention with computational tractability for long-horizon VLN, anchoring the broader LongFly pipeline’s efficacy in environments characterized by visual and structural complexity (Jiang et al., 26 Dec 2025).