Papers
Topics
Authors
Recent
2000 character limit reached

Aerial Vision-and-Language Navigation

Updated 23 December 2025
  • Aerial Vision-and-Language Navigation (AVLN) is a field that combines UAV control, natural language understanding, and visual perception to autonomously navigate complex 3D spaces.
  • It leverages semantic, topological, and metric representations along with LLM-guided planning to translate free-form instructions into precise flight commands.
  • Key challenges include long-horizon planning, interpreting ambiguous language, and ensuring robust sim-to-real transfer in dynamic urban settings.

Aerial Vision-and-Language Navigation (AVLN) describes the problem of autonomously piloting unmanned aerial vehicles (UAVs) in complex environments by interpreting free-form natural language instructions and leveraging visual perception from onboard sensors. In AVLN, an agent receives a goal-oriented linguistic command and must navigate 3D continuous spaces using egocentric visual inputs, executing a sequence of actions to reach a spatial target referenced in the instruction. The field unifies embodied multimodal artificial intelligence, natural language understanding, spatial and geometric reasoning, and aerial robotic control at urban or regional scale.

1. Formal Task Definition and Key Challenges

In AVLN, the UAV's state at each time step is defined by its full 6-DoF pose Pt=[xt,yt,zt,ϕt,θt,ψt]R6P_t = [x_t, y_t, z_t, \phi_t, \theta_t, \psi_t] \in \mathbb{R}^6, where position (x,y,z)(x,y,z) and orientation (Euler angles ϕ,θ,ψ\phi, \theta, \psi) are typically tracked. The primary observation is a sequence of egocentric RGB and/or depth images (ItRI_t^R, ItDI_t^D), with some paradigms exploiting panoramic or multi-view sensors. The agent receives a natural language path description LL, which may contain free-form references to landmarks, directional cues, and sub-goal sequencing ("Lift off, head north over the lake until you see the red building on your right...").

Action spaces vary, with a common structure of discretized high-level motion primitives parameterized by yaw and distance—e.g., {forward,backward,left,right,up,down}\{\text{forward}, \text{backward}, \text{left}, \text{right}, \text{up}, \text{down}\} with associated (Δψ,d)(\Delta \psi, d) for heading change and translation (Gao et al., 11 Oct 2024). Continuous velocity control is also explored (Zhang et al., 12 Jun 2025). Success is measured by proximity to the goal—often within a 20 m Euclidean threshold—or by intermediate metrics such as navigation error (NE), success rate (SR), oracle SR (OSR), and path efficiency (SPL).

Core challenges include:

  • Generalizing from seen to unseen environments in high-dimensional, obstacle-rich, and dynamically structured 3D aerial domains.
  • Interpreting complex, ambiguous, and often abstract spatial language with varying levels of specificity and granularity.
  • Handling long-horizon planning without predefined navigation graphs, requiring robust global and local spatial inference.
  • Precise altitude control and 3D collision avoidance, subject to real-world UAV kinematics, especially under limited sensor payloads.

2. Representation and Spatial Reasoning Paradigms

Modern AVLN systems employ diverse spatial and semantic representations to bridge the gap between visual input, ambiguous language, and action selection.

Semantic-Topo-Metric Representations (STMR)

A prominent paradigm encodes the environment as a 2D matrix around the current pose, discretized at fixed metric granularity (e.g., a 20×2020\times20 grid, 5 m per cell). Semantic masks of instruction-relevant landmarks are detected and projected onto this grid, fusing semantic (object types), topological (spatial arrangement), and metric (relative distances) information. The agent’s position and heading are explicitly coded, yielding a compact prompt for LLM-based reasoning (Gao et al., 11 Oct 2024).

Map-based and Grid-based Maps

CityNav (Lee et al., 20 Jun 2024) and related models maintain multi-channel navigation maps with layers encoding current view, explored area, known landmarks (from OSM or extracted via LLM), probable target locations (via visual detectors), and contextual objects. Discrete grid maps, often constructed as bird’s-eye view (BEV) tensors, accumulate spatial context, enabling long-horizon planning and mitigating partial observability (Zhao et al., 14 Mar 2025).

Coarse-to-Fine and Hierarchical Models

Multi-stage frameworks, such as HETT (Ding et al., 16 Dec 2025), employ an initial coarse-grained target prediction—fusing landmarks, historical trajectory memory, and instruction embeddings—and a subsequent fine-grained local action refinement via cross-modal fusion. Hierarchical semantic planning (landmark \to object \to motion) further decomposes the planning space, reducing complexity from mnm^n (flat action space) to a chain of manageable sub-tasks (Zhang et al., 8 May 2025).

3. Learning Architectures and LLM Integration

Aerial VLN architectures draw from sequence modeling, transformer-based cross-modal fusion, and LLM prompting.

  • LLM-guided Zero-Shot Planning: STMR (Gao et al., 11 Oct 2024) demonstrates that a frozen LLM (GPT-4V/o) can synthesize continuous flight commands via prompt engineering—receiving a text serialization of the semantic grid, instruction sub-goals, history, and plan hints, and outputting precise next actions. No navigation-policy training is required.
  • Unified Next-Token Paradigms: Some architectures model navigation, spatial perception, and trajectory summarization as a single next-token prediction problem, leveraging prompt-guided multi-task learning for robust cross-modal reasoning (Xu et al., 9 Dec 2025).
  • Chain-of-Thought (CoT) and Interpretable Reasoning: Architectures such as FlightGPT (Cai et al., 19 May 2025) integrate explicit CoT by prompting the VLM/LLM to reason step-by-step—parsing instructions, conducting spatial inference on semantic maps, and justifying target predictions prior to decoding coordinates. CoT chains are enforced via supervised and reinforcement losses.
  • End-to-End Vision-Language Modules: Systems like UAV-VLN (Saxena et al., 30 Apr 2025) fuse fine-tuned LLM intent parsing with open-vocabulary vision models (e.g., Grounding DINO), aligning linguistic and visual features via attention. These modules are jointly optimized, e.g., via cross-modal grounding losses, leading to robust and interpretable trajectory planning.

The rise of prompt-based, closed-loop, or learning-free controllers (e.g., SPF (Hu et al., 26 Sep 2025)) reframes action prediction as 2D spatial waypoint grounding, further simplifying the architectural stack and exploiting foundation model generality.

4. Training, Evaluation Protocols, and Datasets

AVLN has catalyzed the development of large-scale, photorealistic 3D benchmarks with richly annotated language, trajectory, and map information.

5. Key Results and Comparative Empirical Findings

Selected results on standard benchmarks underscore both algorithmic innovation and the persistent difficulty of AVLN.

Model Dataset SR (%) OSR (%) SPL (%) NE (m) Reference
Human (AerialVLN) AerialVLN 80.8 80.8 14.2 9.8 (Liu et al., 2023)
CMA (AerialVLN baseline) AerialVLN 1.6 4.1 0.5 359 (Liu et al., 2023)
STMR+GPT-4o (Zero-shot) AerialVLN-S 10.8 23.0 1.9 119.5 (Gao et al., 11 Oct 2024)
FlightGPT (SFT+RL, CityNav) CityNav 21.2 35.4 19.2 76.2 (Cai et al., 19 May 2025)
HETT* (ann. refined, CityNav) CityNav 31.09 (Ding et al., 16 Dec 2025)
OpenVLN (RL, hard set, 25% data) TravelUAV 13.3 4.1 (Lin et al., 9 Nov 2025)
OpenFly-Agent OpenFly 18.5 50.9 12.2 115 (Gao et al., 25 Feb 2025)
AeroDuo (Dual-UAV) HaL-13k 16.6 28.6 13.9 84.3 (Wu et al., 21 Aug 2025)
SPF (learning-free, DRL sim) DRL Sim 93.9 (Hu et al., 26 Sep 2025)

Absolute performance remains substantially below human across all realistic city-scale datasets, especially in unseen environments and open-world, long-horizon settings. Ablations show that fusing hierarchical planning, explicit memory (experience or map), instruction decomposition, and cross-modal semantic/metric representations consistently yields additive improvements.

6. Generalization, Interpretable Reasoning, and Open Challenges

The AVLN community has uncovered a gap between simulated/algorithmic agents and human-level spatial reasoning, especially regarding:

  • Generalization to out-of-distribution geography, unseen object categories, and abstract or indirect instructions.
  • Long-horizon credit assignment and action compounding without route drift or myopic failures.
  • Robustness to missing or noisy landmark detection, ambiguous language, or dynamic obstacles.
  • Sim-to-real transfer, considering real-world flight dynamics, localization drift, and computational constraints for onboard inference.

Approaches that leverage model-agnostic visual-language grounding (e.g., SPF (Hu et al., 26 Sep 2025), VLFly (Zhang et al., 12 Jun 2025)), chain-of-thought prompting (FlightGPT (Cai et al., 19 May 2025)), and collaborative dual-altitude agents (AeroDuo (Wu et al., 21 Aug 2025)) exemplify current directions addressing these limitations.

Expected advancements include:

7. Impact, Datasets, and Practical Applications

AVLN is foundational for UAV autonomy in urban inspection, search-and-rescue, infrastructure monitoring, and delivery—each demanding robust language grounding, complex spatial planning, and safety assurance at city scale. Datasets such as AerialVLN (Liu et al., 2023), CityNav (Lee et al., 20 Jun 2024), OpenFly (Gao et al., 25 Feb 2025), HaL-13k (Wu et al., 21 Aug 2025), and IndoorUAV (Liu et al., 22 Dec 2025) have standardized evaluation, catalyzed techniques from multimodal representation to cross-modal prompting, and revealed both the power and current limitations of LLM+VLM-driven robotic navigation in aerial settings.

The field remains open for substantial improvement, particularly in matching human-level flexibility, integrating continuous reasoning, and achieving robust sim-to-real transfer. AVLN thus continues to drive research in scalable, interpretable, and highly generalizable embodied AI navigation (Gao et al., 11 Oct 2024, Xu et al., 9 Dec 2025, Cai et al., 19 May 2025, Ding et al., 16 Dec 2025, Zhao et al., 14 Mar 2025, Gao et al., 25 Feb 2025, Zhang et al., 12 Jun 2025, Wu et al., 21 Aug 2025, Hu et al., 26 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Aerial Vision-and-Language Navigation (AVLN).