LongFly: UAV Navigation Framework

Updated 13 March 2026

LongFly is a spatiotemporal framework that compresses historical multi-view UAV images, encodes trajectory data, and integrates multimodal language instructions for extended navigation tasks.
It utilizes dedicated modules—SHIC, STE, and PGM—that work in sequence to efficiently capture visual history, spatial trajectories, and guidance cues in real time.
The framework demonstrates significant improvements in waypoint prediction accuracy and navigation success, making it ideal for complex, post-disaster search and rescue operations.

LongFly is a spatiotemporal context modeling framework developed for long-horizon unmanned aerial vehicle (UAV) vision-and-language navigation (VLN) tasks, particularly under conditions prevalent in post-disaster search and rescue where high information density, dynamic structures, and rapidly varying viewpoints challenge existing methods. It introduces history-aware context modeling to overcome the limitations of inaccurate semantic alignment and unstable path prediction characteristic of prior VLN approaches in complex, long-horizon environments. At the core of LongFly are three interlinked modules: Slot-based Historical Image Compression (SHIC), Spatiotemporal Trajectory Encoder (STE), and Prompt-Guided Multimodal (PGM) integration, collectively optimized for robust waypoint prediction via efficient integration of spatiotemporal and instruction contexts (Jiang et al., 26 Dec 2025).

1. Architectural Components of LongFly

LongFly’s framework is organized into three sequential modules:

Slot-based Historical Image Compression (SHIC): Converts sequences of past multi-view UAV images into a compact, fixed-size “slot” representation, supporting efficient retrieval of spatiotemporal information without memory growth over time.
Spatiotemporal Trajectory Encoder (STE): Encodes the 3D coordinates of previously visited waypoints to capture trajectory dynamics and spatial structure relevant to the navigation task.
Prompt-Guided Multimodal Integration (PGM): Fuses the compressed visual slots, encoded trajectory tokens, and language instruction to facilitate context-aware, temporally coherent waypoint reasoning and prediction.

This architectural pipeline is designed so that SHIC provides the compact visual memory, STE supplies trajectory context, and PGM achieves holistic multimodal integration, enabling temporally-aware decision-making over extended navigation horizons.

2. Slot-based Historical Image Compression (SHIC) Module

SHIC is designed to summarize unlimited sequences of multi-view observations into a fixed-length, expressive visual memory for subsequent multimodal integration:

Inputs: For each past time step $i$ ( $i=1,…,t-1$ ), the UAV acquires $V=5$ RGB images ( $R_i = \{R_i^1,…,R_i^5\}$ ), corresponding to front, rear, left, right, and bottom cameras.
Feature Extraction: Each image $R_i^v$ is processed by a CLIP-based ViT-L/14 visual encoder $F_v$ , producing $N_i$ tokens $Z_i^v = \{z_{i,j}^v \in \mathbb{R}^d | j=1…N_i\}$ , with $d=768$ or $1024$.
Slot Memory: For each view $v$ , SHIC maintains a slot set $S^v_i = \{s_{i,k}^v \in \mathbb{R}^d | k=1…K\}$ , where $K$ (default $32$) is the per-view slot count.
Recurrent Slot Attention: At each past step, the slot memory is updated via attention between slots (queries) and visual tokens (keys/values), followed by GRU-based recurrence:
1. Project slots and tokens via learned matrices ( $W_q$ , $W_k$ , $W_v \in \mathbb{R}^{d \times d}$ ):
$q_{i-1,k}^v = W_q s_{i-1,k}^v, \quad k_{i,j}^v = W_k z_{i,j}^v, \quad v_{i,j}^v = W_v z_{i,j}^v$

Compute per-slot attention weights:

$\alpha_{i,k,j}^v = \frac{\exp(q_{i-1,k}^{v\top}k_{i,j}^v/\sqrt{d})}{\sum_{j'=1}^{N_i} \exp(q_{i-1,k}^{v\top}k_{i,j'}^v/\sqrt{d})}$
Aggregate updates for each slot:

$\hat{s}_{i,k}^v = \sum_{j=1}^{N_i} \alpha_{i,k,j}^v v_{i,j}^v$
Update slots jointly via GRU:

$S_i^v = GRU(S_{i-1}^v, \hat{S}_i^v)$

where $\hat{S}_i^v = \{\hat{s}_{i,1}^v, ..., \hat{s}_{i,K}^v\}$ .

Output: The final compressed memory after processing all past steps is $S_{t-1} = \cup_{v}^{V} S_{t-1}^v$ ( $\mathbb{R}^{V \cdot K \times d}$ ), a sequence-invariant representation of the entire history.

No additional loss term is applied to SHIC; its parameters are optimized end-to-end via the downstream navigation regression objective.

3. Hyperparameters, Inputs/Outputs, and Complexity

Key SHIC hyperparameters and their roles:

Parameter	Typical Value(s)	Role/Trade-off
Number of slots ( $K$ )	$8$, $24$, $32$	Larger $K$ captures more visual detail; costs more compute
Slot dimension ( $d$ )	$768$ or $1024$	CLIP token size; affects memory and representation power
Number of views ( $V$ )	$5$	Reflects sensor configuration (multi-camera UAV)
Attention scaling	$1/\sqrt{d}$	Ensures stable gradients in softmax attention
GRU hidden size	$d$	Matches slot dimension; shared across views

Inputs to SHIC are past images $R_i^v$ , while the output is a compact slot matrix of size $(V \cdot K) \times d$ . The overall computational complexity per navigation step is $O(V \cdot K \cdot N_i \cdot d)$ , independent of the time horizon. The memory footprint for historical context is constrained to $O(V \cdot K \cdot d)$ due to the constant-size slot representation, in stark contrast to $O(t \cdot N_i \cdot d)$ for naive concatenation of past tokens. The effective compression ratio is approximately $(t \cdot N_i)/K$ ; for $t=100$ , $N_i=49$ , $K=32$ , this yields a ratio of $153$.

A smaller $K$ may lead to loss of crucial visual details; a larger $K$ decreases compression efficiency. Empirical findings support $K=32$ as a favorable compromise.

4. SHIC Algorithmic Flow and Pseudocode

The SHIC procedure is outlined below:

for each view v in 1…V:
    S^v ← Φ^v      # shape (K×d)

for i = 1 to t–1:
    for each view v in 1…V:
        # Extract tokens from current image
        Z ← F_v(R_i^v)                 # Z shape (N_i×d)
        # Project queries, keys, values
        Q ← S^v W_q^⊤                  # (K×d)
        Kmat ← Z W_k^⊤                 # (N_i×d)
        Vmat ← Z W_v^⊤                 # (N_i×d)
        # Compute attention and updates
        for k = 1 to K:
            for j = 1 to N_i:
                α_{k,j} = softmax_j(Q[k] · Kmat[j] / √d)
            \hat s_k = sum_j(α_{k,j} * Vmat[j])
        # GRU update
        S^v ← GRU(prev_state=S^v, input={\hat s_1,…,\hat s_K})

This process ensures that at each navigation step, the historical multi-view images are progressively distilled into a set of slots, each recurrently updated and operated on independently per view, and ultimately concatenated to form the visual memory available for high-level multimodal reasoning.

5. Integration with Spatiotemporal and Instructional Context

After SHIC, the slot-based visual summary is linearly projected and provided as input to the Prompt-Guided Multimodal (PGM) integration module alongside trajectory tokens from the Spatiotemporal Trajectory Encoder and textual instructions. The integration is performed using a Qwen-based LLM backbone, supporting robust, temporally-informed waypoint prediction. The training objective for the entire system is to minimize a regression loss on next-waypoint coordinates, optionally incorporating an instruction-alignment loss as part of the multimodal large language modeling (MLLM) component. No loss is explicitly attributed to the SHIC module; all parameters are updated end-to-end for downstream navigation performance (Jiang et al., 26 Dec 2025).

6. Empirical Performance and Applications

LongFly demonstrates significant improvements over prior UAV VLN methods, outperforming state-of-the-art baselines by 7.89% in success rate and 6.33% in success weighted by path length. Gains are sustained across both seen and unseen environments, emphasizing the framework’s robustness and generalization. Applications are concentrated in post-disaster search and rescue but extend to any long-horizon UAV navigation scenarios requiring efficient, context-aware spatial reasoning.

7. Considerations and Trade-offs

The slot-based compression strategy in SHIC enables unbounded temporal context with a fixed computational and memory budget. A recognized trade-off is the choice of slot count $K$ : inadequate capacity results in lossy scene summarization, while excessive slots undermine compression efficiency. The design balances dynamic context retention with computational tractability for long-horizon VLN, anchoring the broader LongFly pipeline’s efficacy in environments characterized by visual and structural complexity (Jiang et al., 26 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LongFly Framework.

LongFly: UAV Navigation Framework

1. Architectural Components of LongFly

2. Slot-based Historical Image Compression (SHIC) Module

3. Hyperparameters, Inputs/Outputs, and Complexity

4. SHIC Algorithmic Flow and Pseudocode

5. Integration with Spatiotemporal and Instructional Context

6. Empirical Performance and Applications

7. Considerations and Trade-offs

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

LongFly: UAV Navigation Framework

1. Architectural Components of LongFly

2. Slot-based Historical Image Compression (SHIC) Module

3. Hyperparameters, Inputs/Outputs, and Complexity

4. SHIC Algorithmic Flow and Pseudocode

5. Integration with Spatiotemporal and Instructional Context

6. Empirical Performance and Applications

7. Considerations and Trade-offs

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research