Aerial Vision-and-Language Navigation

Updated 23 December 2025

Aerial Vision-and-Language Navigation (AVLN) is a field that combines UAV control, natural language understanding, and visual perception to autonomously navigate complex 3D spaces.
It leverages semantic, topological, and metric representations along with LLM-guided planning to translate free-form instructions into precise flight commands.
Key challenges include long-horizon planning, interpreting ambiguous language, and ensuring robust sim-to-real transfer in dynamic urban settings.

Aerial Vision-and-Language Navigation (AVLN) describes the problem of autonomously piloting unmanned aerial vehicles (UAVs) in complex environments by interpreting free-form natural language instructions and leveraging visual perception from onboard sensors. In AVLN, an agent receives a goal-oriented linguistic command and must navigate 3D continuous spaces using egocentric visual inputs, executing a sequence of actions to reach a spatial target referenced in the instruction. The field unifies embodied multimodal artificial intelligence, natural language understanding, spatial and geometric reasoning, and aerial robotic control at urban or regional scale.

1. Formal Task Definition and Key Challenges

In AVLN, the UAV's state at each time step is defined by its full 6-DoF pose $P_t = [x_t, y_t, z_t, \phi_t, \theta_t, \psi_t] \in \mathbb{R}^6$ , where position $(x,y,z)$ and orientation (Euler angles $\phi, \theta, \psi$ ) are typically tracked. The primary observation is a sequence of egocentric RGB and/or depth images ( $I_t^R$ , $I_t^D$ ), with some paradigms exploiting panoramic or multi-view sensors. The agent receives a natural language path description $L$ , which may contain free-form references to landmarks, directional cues, and sub-goal sequencing ("Lift off, head north over the lake until you see the red building on your right...").

Action spaces vary, with a common structure of discretized high-level motion primitives parameterized by yaw and distance—e.g., $\{\text{forward}, \text{backward}, \text{left}, \text{right}, \text{up}, \text{down}\}$ with associated $(\Delta \psi, d)$ for heading change and translation (Gao et al., 11 Oct 2024). Continuous velocity control is also explored (Zhang et al., 12 Jun 2025). Success is measured by proximity to the goal—often within a 20 m Euclidean threshold—or by intermediate metrics such as navigation error (NE), success rate (SR), oracle SR (OSR), and path efficiency (SPL).

Core challenges include:

Generalizing from seen to unseen environments in high-dimensional, obstacle-rich, and dynamically structured 3D aerial domains.
Interpreting complex, ambiguous, and often abstract spatial language with varying levels of specificity and granularity.
Handling long-horizon planning without predefined navigation graphs, requiring robust global and local spatial inference.
Precise altitude control and 3D collision avoidance, subject to real-world UAV kinematics, especially under limited sensor payloads.

2. Representation and Spatial Reasoning Paradigms

Modern AVLN systems employ diverse spatial and semantic representations to bridge the gap between visual input, ambiguous language, and action selection.

Semantic-Topo-Metric Representations (STMR)

A prominent paradigm encodes the environment as a 2D matrix around the current pose, discretized at fixed metric granularity (e.g., a $20\times20$ grid, 5 m per cell). Semantic masks of instruction-relevant landmarks are detected and projected onto this grid, fusing semantic (object types), topological (spatial arrangement), and metric (relative distances) information. The agent’s position and heading are explicitly coded, yielding a compact prompt for LLM-based reasoning (Gao et al., 11 Oct 2024).

Map-based and Grid-based Maps

CityNav (Lee et al., 20 Jun 2024) and related models maintain multi-channel navigation maps with layers encoding current view, explored area, known landmarks (from OSM or extracted via LLM), probable target locations (via visual detectors), and contextual objects. Discrete grid maps, often constructed as bird’s-eye view (BEV) tensors, accumulate spatial context, enabling long-horizon planning and mitigating partial observability (Zhao et al., 14 Mar 2025).

Coarse-to-Fine and Hierarchical Models

Multi-stage frameworks, such as HETT (Ding et al., 16 Dec 2025), employ an initial coarse-grained target prediction—fusing landmarks, historical trajectory memory, and instruction embeddings—and a subsequent fine-grained local action refinement via cross-modal fusion. Hierarchical semantic planning (landmark $\to$ object $\to$ motion) further decomposes the planning space, reducing complexity from $m^n$ (flat action space) to a chain of manageable sub-tasks (Zhang et al., 8 May 2025).

3. Learning Architectures and LLM Integration

Aerial VLN architectures draw from sequence modeling, transformer-based cross-modal fusion, and LLM prompting.

LLM-guided Zero-Shot Planning: STMR (Gao et al., 11 Oct 2024) demonstrates that a frozen LLM (GPT-4V/o) can synthesize continuous flight commands via prompt engineering—receiving a text serialization of the semantic grid, instruction sub-goals, history, and plan hints, and outputting precise next actions. No navigation-policy training is required.
Unified Next-Token Paradigms: Some architectures model navigation, spatial perception, and trajectory summarization as a single next-token prediction problem, leveraging prompt-guided multi-task learning for robust cross-modal reasoning (Xu et al., 9 Dec 2025).
Chain-of-Thought (CoT) and Interpretable Reasoning: Architectures such as FlightGPT (Cai et al., 19 May 2025) integrate explicit CoT by prompting the VLM/LLM to reason step-by-step—parsing instructions, conducting spatial inference on semantic maps, and justifying target predictions prior to decoding coordinates. CoT chains are enforced via supervised and reinforcement losses.
End-to-End Vision-Language Modules: Systems like UAV-VLN (Saxena et al., 30 Apr 2025) fuse fine-tuned LLM intent parsing with open-vocabulary vision models (e.g., Grounding DINO), aligning linguistic and visual features via attention. These modules are jointly optimized, e.g., via cross-modal grounding losses, leading to robust and interpretable trajectory planning.

The rise of prompt-based, closed-loop, or learning-free controllers (e.g., SPF (Hu et al., 26 Sep 2025)) reframes action prediction as 2D spatial waypoint grounding, further simplifying the architectural stack and exploiting foundation model generality.

4. Training, Evaluation Protocols, and Datasets

AVLN has catalyzed the development of large-scale, photorealistic 3D benchmarks with richly annotated language, trajectory, and map information.

Datasets:
- AerialVLN (Liu et al., 2023): >8,400 human-demonstrated trajectories in 25 outdoor scenes, with paired multimodal instructions.
- AerialVLN-S (Gao et al., 11 Oct 2024, Xu et al., 9 Dec 2025): A smaller variant targeting faster evaluation and ablation.
- CityNav (Lee et al., 20 Jun 2024, Ding et al., 16 Dec 2025): 32k instructions/trajectories on real city scans with explicit landmark mapping.
- OpenFly (Gao et al., 25 Feb 2025): 100,000 synthetic trajectories generated via multi-engine platform.
- HaL-13k (Wu et al., 21 Aug 2025): 13,838 dual-UAV (high/low altitude) collaborative paths for dual-agent VLN.
Metrics:
- Navigation Error (NE): Final Euclidean distance to goal.
- Success Rate (SR): Fraction within threshold (commonly 20 m).
- Oracle SR (OSR): Whether any visited point is within threshold.
- SPL (Success weighted by Path Length): Path efficiency.
- SDTW/NDTW: Dynamic Time Warping (and normalized variants) to weight path similarity by goal achievement.
Training Regimes:
- Imitation learning with teacher-forcing and DAgger are common for policy optimization (Liu et al., 2023).
- Reinforcement learning with value-based long-horizon planners and verifiable dense rewards advances robust exploration under limited demonstration data (Lin et al., 9 Nov 2025).
- Next-token, prompt-guided learning is co-trained for spatial, temporal, and action outputs (Xu et al., 9 Dec 2025).
- Hierarchical decoupling (task decomposition, global memory graphs) is explored to alleviate combinatorial complexity and enhance generalization (Zhang et al., 8 May 2025, Ding et al., 16 Dec 2025).
- Zero-shot LLM/VLM prompting obviates task-specific policy training in some frameworks (Gao et al., 11 Oct 2024, Hu et al., 26 Sep 2025, Zhang et al., 12 Jun 2025).

5. Key Results and Comparative Empirical Findings

Selected results on standard benchmarks underscore both algorithmic innovation and the persistent difficulty of AVLN.

Model	Dataset	SR (%)	OSR (%)	SPL (%)	NE (m)	Reference
Human (AerialVLN)	AerialVLN	80.8	80.8	14.2	9.8	(Liu et al., 2023)
CMA (AerialVLN baseline)	AerialVLN	1.6	4.1	0.5	359	(Liu et al., 2023)
STMR+GPT-4o (Zero-shot)	AerialVLN-S	10.8	23.0	1.9	119.5	(Gao et al., 11 Oct 2024)
FlightGPT (SFT+RL, CityNav)	CityNav	21.2	35.4	19.2	76.2	(Cai et al., 19 May 2025)
HETT* (ann. refined, CityNav)	CityNav	31.09	–	–	–	(Ding et al., 16 Dec 2025)
OpenVLN (RL, hard set, 25% data)	TravelUAV	13.3	–	4.1	–	(Lin et al., 9 Nov 2025)
OpenFly-Agent	OpenFly	18.5	50.9	12.2	115	(Gao et al., 25 Feb 2025)
AeroDuo (Dual-UAV)	HaL-13k	16.6	28.6	13.9	84.3	(Wu et al., 21 Aug 2025)
SPF (learning-free, DRL sim)	DRL Sim	93.9	–	–	–	(Hu et al., 26 Sep 2025)

Absolute performance remains substantially below human across all realistic city-scale datasets, especially in unseen environments and open-world, long-horizon settings. Ablations show that fusing hierarchical planning, explicit memory (experience or map), instruction decomposition, and cross-modal semantic/metric representations consistently yields additive improvements.

6. Generalization, Interpretable Reasoning, and Open Challenges

The AVLN community has uncovered a gap between simulated/algorithmic agents and human-level spatial reasoning, especially regarding:

Generalization to out-of-distribution geography, unseen object categories, and abstract or indirect instructions.
Long-horizon credit assignment and action compounding without route drift or myopic failures.
Robustness to missing or noisy landmark detection, ambiguous language, or dynamic obstacles.
Sim-to-real transfer, considering real-world flight dynamics, localization drift, and computational constraints for onboard inference.

Approaches that leverage model-agnostic visual-language grounding (e.g., SPF (Hu et al., 26 Sep 2025), VLFly (Zhang et al., 12 Jun 2025)), chain-of-thought prompting (FlightGPT (Cai et al., 19 May 2025)), and collaborative dual-altitude agents (AeroDuo (Wu et al., 21 Aug 2025)) exemplify current directions addressing these limitations.

Expected advancements include:

End-to-end, online mapping and global memory without external priors (Ding et al., 16 Dec 2025).
Dynamic re-planning and explicit backtracking upon failure or misalignment (Zhang et al., 8 May 2025).
Multi-agent collaboration and decentralization (Wu et al., 21 Aug 2025).
Onboard, resource-constrained deployment via model compression and efficient inference stacks (Gao et al., 25 Feb 2025).
Integrating real-time question-asking and dialog for interactive disambiguation (Fan et al., 2022).

7. Impact, Datasets, and Practical Applications

AVLN is foundational for UAV autonomy in urban inspection, search-and-rescue, infrastructure monitoring, and delivery—each demanding robust language grounding, complex spatial planning, and safety assurance at city scale. Datasets such as AerialVLN (Liu et al., 2023), CityNav (Lee et al., 20 Jun 2024), OpenFly (Gao et al., 25 Feb 2025), HaL-13k (Wu et al., 21 Aug 2025), and IndoorUAV (Liu et al., 22 Dec 2025) have standardized evaluation, catalyzed techniques from multimodal representation to cross-modal prompting, and revealed both the power and current limitations of LLM+VLM-driven robotic navigation in aerial settings.

The field remains open for substantial improvement, particularly in matching human-level flexibility, integrating continuous reasoning, and achieving robust sim-to-real transfer. AVLN thus continues to drive research in scalable, interpretable, and highly generalizable embodied AI navigation (Gao et al., 11 Oct 2024, Xu et al., 9 Dec 2025, Cai et al., 19 May 2025, Ding et al., 16 Dec 2025, Zhao et al., 14 Mar 2025, Gao et al., 25 Feb 2025, Zhang et al., 12 Jun 2025, Wu et al., 21 Aug 2025, Hu et al., 26 Sep 2025).