Papers
Topics
Authors
Recent
Search
2000 character limit reached

LongFly: UAV Navigation Framework

Updated 13 March 2026
  • LongFly is a spatiotemporal framework that compresses historical multi-view UAV images, encodes trajectory data, and integrates multimodal language instructions for extended navigation tasks.
  • It utilizes dedicated modules—SHIC, STE, and PGM—that work in sequence to efficiently capture visual history, spatial trajectories, and guidance cues in real time.
  • The framework demonstrates significant improvements in waypoint prediction accuracy and navigation success, making it ideal for complex, post-disaster search and rescue operations.

LongFly is a spatiotemporal context modeling framework developed for long-horizon unmanned aerial vehicle (UAV) vision-and-language navigation (VLN) tasks, particularly under conditions prevalent in post-disaster search and rescue where high information density, dynamic structures, and rapidly varying viewpoints challenge existing methods. It introduces history-aware context modeling to overcome the limitations of inaccurate semantic alignment and unstable path prediction characteristic of prior VLN approaches in complex, long-horizon environments. At the core of LongFly are three interlinked modules: Slot-based Historical Image Compression (SHIC), Spatiotemporal Trajectory Encoder (STE), and Prompt-Guided Multimodal (PGM) integration, collectively optimized for robust waypoint prediction via efficient integration of spatiotemporal and instruction contexts (Jiang et al., 26 Dec 2025).

1. Architectural Components of LongFly

LongFly’s framework is organized into three sequential modules:

  1. Slot-based Historical Image Compression (SHIC): Converts sequences of past multi-view UAV images into a compact, fixed-size “slot” representation, supporting efficient retrieval of spatiotemporal information without memory growth over time.
  2. Spatiotemporal Trajectory Encoder (STE): Encodes the 3D coordinates of previously visited waypoints to capture trajectory dynamics and spatial structure relevant to the navigation task.
  3. Prompt-Guided Multimodal Integration (PGM): Fuses the compressed visual slots, encoded trajectory tokens, and language instruction to facilitate context-aware, temporally coherent waypoint reasoning and prediction.

This architectural pipeline is designed so that SHIC provides the compact visual memory, STE supplies trajectory context, and PGM achieves holistic multimodal integration, enabling temporally-aware decision-making over extended navigation horizons.

2. Slot-based Historical Image Compression (SHIC) Module

SHIC is designed to summarize unlimited sequences of multi-view observations into a fixed-length, expressive visual memory for subsequent multimodal integration:

  • Inputs: For each past time step ii (i=1,,t1i=1,…,t-1), the UAV acquires V=5V=5 RGB images (Ri={Ri1,,Ri5}R_i = \{R_i^1,…,R_i^5\}), corresponding to front, rear, left, right, and bottom cameras.
  • Feature Extraction: Each image RivR_i^v is processed by a CLIP-based ViT-L/14 visual encoder FvF_v, producing NiN_i tokens Ziv={zi,jvRdj=1Ni}Z_i^v = \{z_{i,j}^v \in \mathbb{R}^d | j=1…N_i\}, with d=768d=768 or $1024$.
  • Slot Memory: For each view vv, SHIC maintains a slot set Siv={si,kvRdk=1K}S^v_i = \{s_{i,k}^v \in \mathbb{R}^d | k=1…K\}, where KK (default $32$) is the per-view slot count.
  • Recurrent Slot Attention: At each past step, the slot memory is updated via attention between slots (queries) and visual tokens (keys/values), followed by GRU-based recurrence:

    1. Project slots and tokens via learned matrices (WqW_q, WkW_k, WvRd×dW_v \in \mathbb{R}^{d \times d}):

    qi1,kv=Wqsi1,kv,ki,jv=Wkzi,jv,vi,jv=Wvzi,jvq_{i-1,k}^v = W_q s_{i-1,k}^v, \quad k_{i,j}^v = W_k z_{i,j}^v, \quad v_{i,j}^v = W_v z_{i,j}^v

  1. Compute per-slot attention weights:

    αi,k,jv=exp(qi1,kvki,jv/d)j=1Niexp(qi1,kvki,jv/d)\alpha_{i,k,j}^v = \frac{\exp(q_{i-1,k}^{v\top}k_{i,j}^v/\sqrt{d})}{\sum_{j'=1}^{N_i} \exp(q_{i-1,k}^{v\top}k_{i,j'}^v/\sqrt{d})}

  2. Aggregate updates for each slot:

    s^i,kv=j=1Niαi,k,jvvi,jv\hat{s}_{i,k}^v = \sum_{j=1}^{N_i} \alpha_{i,k,j}^v v_{i,j}^v

  3. Update slots jointly via GRU:

    Siv=GRU(Si1v,S^iv)S_i^v = GRU(S_{i-1}^v, \hat{S}_i^v)

    where S^iv={s^i,1v,...,s^i,Kv}\hat{S}_i^v = \{\hat{s}_{i,1}^v, ..., \hat{s}_{i,K}^v\}.

  • Output: The final compressed memory after processing all past steps is St1=vVSt1vS_{t-1} = \cup_{v}^{V} S_{t-1}^v (RVK×d\mathbb{R}^{V \cdot K \times d}), a sequence-invariant representation of the entire history.

No additional loss term is applied to SHIC; its parameters are optimized end-to-end via the downstream navigation regression objective.

3. Hyperparameters, Inputs/Outputs, and Complexity

Key SHIC hyperparameters and their roles:

Parameter Typical Value(s) Role/Trade-off
Number of slots (KK) $8$, $24$, $32$ Larger KK captures more visual detail; costs more compute
Slot dimension (dd) $768$ or $1024$ CLIP token size; affects memory and representation power
Number of views (VV) $5$ Reflects sensor configuration (multi-camera UAV)
Attention scaling 1/d1/\sqrt{d} Ensures stable gradients in softmax attention
GRU hidden size dd Matches slot dimension; shared across views

Inputs to SHIC are past images RivR_i^v, while the output is a compact slot matrix of size (VK)×d(V \cdot K) \times d. The overall computational complexity per navigation step is O(VKNid)O(V \cdot K \cdot N_i \cdot d), independent of the time horizon. The memory footprint for historical context is constrained to O(VKd)O(V \cdot K \cdot d) due to the constant-size slot representation, in stark contrast to O(tNid)O(t \cdot N_i \cdot d) for naive concatenation of past tokens. The effective compression ratio is approximately (tNi)/K(t \cdot N_i)/K; for t=100t=100, Ni=49N_i=49, K=32K=32, this yields a ratio of $153$.

A smaller KK may lead to loss of crucial visual details; a larger KK decreases compression efficiency. Empirical findings support K=32K=32 as a favorable compromise.

4. SHIC Algorithmic Flow and Pseudocode

The SHIC procedure is outlined below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
for each view v in 1V:
    S^v  Φ^v      # shape (K×d)

for i = 1 to t1:
    for each view v in 1V:
        # Extract tokens from current image
        Z  F_v(R_i^v)                 # Z shape (N_i×d)
        # Project queries, keys, values
        Q  S^v W_q^                  # (K×d)
        Kmat  Z W_k^                 # (N_i×d)
        Vmat  Z W_v^                 # (N_i×d)
        # Compute attention and updates
        for k = 1 to K:
            for j = 1 to N_i:
                α_{k,j} = softmax_j(Q[k] · Kmat[j] / d)
            \hat s_k = sum_j(α_{k,j} * Vmat[j])
        # GRU update
        S^v  GRU(prev_state=S^v, input={\hat s_1,,\hat s_K})

This process ensures that at each navigation step, the historical multi-view images are progressively distilled into a set of slots, each recurrently updated and operated on independently per view, and ultimately concatenated to form the visual memory available for high-level multimodal reasoning.

5. Integration with Spatiotemporal and Instructional Context

After SHIC, the slot-based visual summary is linearly projected and provided as input to the Prompt-Guided Multimodal (PGM) integration module alongside trajectory tokens from the Spatiotemporal Trajectory Encoder and textual instructions. The integration is performed using a Qwen-based LLM backbone, supporting robust, temporally-informed waypoint prediction. The training objective for the entire system is to minimize a regression loss on next-waypoint coordinates, optionally incorporating an instruction-alignment loss as part of the multimodal large language modeling (MLLM) component. No loss is explicitly attributed to the SHIC module; all parameters are updated end-to-end for downstream navigation performance (Jiang et al., 26 Dec 2025).

6. Empirical Performance and Applications

LongFly demonstrates significant improvements over prior UAV VLN methods, outperforming state-of-the-art baselines by 7.89% in success rate and 6.33% in success weighted by path length. Gains are sustained across both seen and unseen environments, emphasizing the framework’s robustness and generalization. Applications are concentrated in post-disaster search and rescue but extend to any long-horizon UAV navigation scenarios requiring efficient, context-aware spatial reasoning.

7. Considerations and Trade-offs

The slot-based compression strategy in SHIC enables unbounded temporal context with a fixed computational and memory budget. A recognized trade-off is the choice of slot count KK: inadequate capacity results in lossy scene summarization, while excessive slots undermine compression efficiency. The design balances dynamic context retention with computational tractability for long-horizon VLN, anchoring the broader LongFly pipeline’s efficacy in environments characterized by visual and structural complexity (Jiang et al., 26 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LongFly Framework.