CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory

Published 8 May 2025 in cs.RO and cs.AI | (2505.05622v1)

Abstract: Aerial vision-and-language navigation (VLN), requiring drones to interpret natural language instructions and navigate complex urban environments, emerges as a critical embodied AI challenge that bridges human-robot interaction, 3D spatial reasoning, and real-world deployment. Although existing ground VLN agents achieved notable results in indoor and outdoor settings, they struggle in aerial VLN due to the absence of predefined navigation graphs and the exponentially expanding action space in long-horizon exploration. In this work, we propose \textbf{CityNavAgent}, a LLM-empowered agent that significantly reduces the navigation complexity for urban aerial VLN. Specifically, we design a hierarchical semantic planning module (HSPM) that decomposes the long-horizon task into sub-goals with different semantic levels. The agent reaches the target progressively by achieving sub-goals with different capacities of the LLM. Additionally, a global memory module storing historical trajectories into a topological graph is developed to simplify navigation for visited targets. Extensive benchmark experiments show that our method achieves state-of-the-art performance with significant improvement. Further experiments demonstrate the effectiveness of different modules of CityNavAgent for aerial VLN in continuous city environments. The code is available at \href{https://github.com/VinceOuti/CityNavAgent}{link}.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces CityNavAgent, which integrates open-vocabulary perception, hierarchical semantic planning, and a global memory module for effective aerial vision-and-language navigation.
The hierarchical planning module decomposes instructions into landmark-, object-, and motion-level sub-goals, reducing planning complexity in continuous urban 3D spaces.
Experimental results demonstrate significant performance gains over prior methods, with ablation studies confirming the essential roles of semantic mapping and memory graph modules.

Introduction and Motivation

CityNavAgent addresses the aerial vision-and-language navigation (VLN) problem, where an embodied agent (e.g., a drone) must interpret natural language instructions and navigate complex urban environments in continuous 3D space. Unlike ground-level VLN, aerial VLN presents unique challenges: the absence of predefined navigation graphs, exponentially expanding action spaces for long-horizon planning, and highly variable semantic density in urban scenes. Existing methods, primarily designed for indoor or ground-level navigation, are insufficient for these settings due to their reliance on discrete topological graphs and limited semantic reasoning capabilities.

System Architecture

CityNavAgent is composed of three principal modules: an open-vocabulary perception module, a hierarchical semantic planning module (HSPM), and a global memory module. The open-vocabulary perception module leverages LLMs and vision foundation models to extract instruction-relevant objects and build a semantic point cloud of the environment. HSPM decomposes navigation instructions into hierarchical sub-goals (landmark-level, object-level, and motion-level), reducing planning complexity and enabling progressive goal achievement. The global memory module stores historical trajectories as a topological graph, facilitating efficient motion planning toward previously visited targets.

Figure 1: CityNavAgent consists of three key modules: open-vocabulary perception, hierarchical semantic planning, and global memory for trajectory storage and reuse.

Open-Vocabulary Perception

The perception module utilizes panoramic RGB-D observations. Scene semantics are extracted via an LLM-based captioner (e.g., GPT-4V) and grounded using a visual detector (GroundingDINO). Semantic masks are generated for detected objects, which are then projected into 3D space using camera intrinsics and agent pose, resulting in a local semantic point cloud. This representation enables robust cross-modal grounding and spatial reasoning, critical for urban environments with high object variety and dynamic semantic density.

Hierarchical Semantic Planning

HSPM operates at three levels:

Landmark-level: LLM parses instructions to extract an ordered sequence of landmark phrases, serving as high-level sub-goals.
Object-level: For each sub-goal, the LLM reasons about the most relevant visible object regions (OROI) in the current panorama, leveraging commonsense knowledge to bridge gaps when landmarks are not directly observable.
Motion-level: The agent computes waypoints by averaging the coordinates of semantic point cloud points corresponding to the selected OROI, then decomposes the path into executable low-level actions.

This hierarchical decomposition narrows the action space and improves alignment with instructions, mitigating error accumulation in long-horizon planning.

Global Memory Module

The global memory module constructs a 3D topological graph from historical successful trajectories. Nodes store both waypoint coordinates and panoramic observations; edges are weighted by spatial distance. The memory graph is incrementally updated and merged with new trajectories, with connectivity established via proximity thresholds. During navigation, when the agent reaches a node in the memory graph, it leverages graph search (Dijkstra-based, augmented with LLM-based semantic similarity scoring) to efficiently plan paths traversing the remaining sub-goals.

Figure 2: The memory module fuses current and historical trajectory graphs, searching for paths with maximal semantic similarity to instruction landmarks.

Figure 3: Memory graphs in different scenes; node disconnections reflect long distances between historical trajectories.

Experimental Results

CityNavAgent is evaluated on AirVLN-S and a fine-grained AirVLN-Enriched benchmark, both simulating urban environments with diverse objects and long navigation paths. Metrics include Success Rate (SR), Oracle Success Rate (OSR), Navigation Error (NE), SPL, and SDTW.

Performance: CityNavAgent achieves state-of-the-art results, outperforming statistical, learning-based, and zero-shot LLM-based baselines. On AirVLN-S, it surpasses LM-Nav and STMR by 1.3% SR, 0.8% SPL, and 0.5% SDTW in seen environments, and by 0.9% SR, 0.6% SPL, and 1.3% SDTW in unseen environments. On AirVLN-Enriched, CityNavAgent achieves 28.3% SR and 23.5% SPL, with a NE of 95.1, representing a substantial improvement over prior methods.
Ablation: Removing the semantic map or memory graph results in significant performance degradation (up to 16.6% SR drop and 116.7% NE increase), highlighting the necessity of both modules. The choice of LLM is critical: GPT-4V yields the best results, with LLaVA-7B suffering from hallucination and unstructured outputs.
Waypoint Prediction: Indoor waypoint predictors (e.g., CWP) perform poorly in outdoor aerial settings due to scale and dimensionality differences, as evidenced by increased Chamfer and Hausdorff distances.
Figure 4: Qualitative result of CWP; predicted waypoints (red) diverge from reference waypoints (black) in outdoor scenes.

Qualitative Analysis

CityNavAgent demonstrates superior instruction-following and path efficiency. The hierarchical planning enables the agent to reason about invisible landmarks via semantic associations (e.g., inferring the location of an obelisk by exploring areas near trees in a park). The memory graph prevents the agent from blind exploration and dead ends, especially in long-range tasks.

Figure 5: Navigation process comparison; CityNavAgent's reasoning process aligns landmarks and semantic objects, outperforming VELMA and LM-Nav.

Failure cases are categorized as instruction ambiguity, perception failure, and reasoning errors. In AirVLN-S, ambiguous instructions dominate failures, while in AirVLN-Enriched, perception failures are more prevalent due to challenging outdoor object detection.

Figure 6: Failure cases; green boxes indicate referred landmarks, red boxes highlight misreferenced landmarks due to ambiguity, perception, or reasoning errors.

Implementation Considerations

Computational Requirements: The system requires panoramic RGB-D sensing, LLM inference (preferably GPT-4V), and vision foundation models (GroundingDINO, Segment Anything). Memory graph construction and search scale with the number of historical trajectories but are efficiently pruned via spherical subgraph extraction and non-maximum suppression.
Deployment: While validated in simulation, real-world deployment necessitates robust low-level motion control, self-pose estimation, and latency management. The absence of a backtracking mechanism is a limitation; future work should integrate path recovery strategies.
Scalability: The modular design allows for incremental expansion of the memory graph and adaptation to new environments. Semantic point cloud construction is parallelizable across panoramic views.

Theoretical and Practical Implications

CityNavAgent demonstrates that hierarchical semantic planning, combined with global memory, is effective for long-horizon navigation in continuous, unstructured environments. The integration of LLMs for instruction parsing and commonsense reasoning bridges the gap between high-level language and low-level action planning. The memory graph approach offers a scalable solution for leveraging historical experience, reducing exploration overhead and improving robustness.

The results suggest that future embodied AI systems should incorporate hierarchical reasoning, open-vocabulary perception, and memory-based planning to handle real-world complexity. The reliance on foundation models and LLMs underscores the importance of multimodal pretraining and prompt engineering for embodied tasks.

Conclusion

CityNavAgent sets a new benchmark for aerial VLN in urban environments by combining open-vocabulary perception, hierarchical semantic planning, and global memory. The system achieves strong numerical improvements over prior methods, with ablation studies confirming the necessity of each module. While simulation results are promising, real-world deployment will require addressing low-level control and path recovery. The architectural principles established here—hierarchical decomposition, semantic grounding, and memory-based planning—are likely to inform future research in embodied navigation and multimodal AI.