Papers
Topics
Authors
Recent
2000 character limit reached

CityNav: Urban Aerial Navigation Dataset

Updated 23 December 2025
  • CityNav is a comprehensive dataset for language-guided aerial navigation in photorealistic 3D urban environments, featuring naturalistic city-level complexity.
  • It integrates multi-modal cues, including first-person RGB-D visuals, georeferenced landmarks, and natural language instructions from detailed human demonstrations.
  • The dataset supports robust evaluations with rigorous spatial metrics and ablation studies, advancing VLN benchmarks through map-based spatial reasoning.

CityNav is a large-scale, real-world vision-and-language navigation (VLN) dataset designed explicitly for benchmarking language-guided aerial navigation in photorealistic 3D urban environments. It emphasizes naturalistic city-level complexity, multi-modal perception, spatial reasoning with georeferenced landmarks, and language-goal grounding. The CityNav dataset targets the VLN community seeking realistic evaluation of navigation agents capable of integrating visual, geographic, and linguistic cues in actual urban settings (Lee et al., 2024, Cai et al., 19 May 2025).

1. Dataset Construction and Geographic Scope

CityNav utilizes high-fidelity photogrammetric city-scale 3D point clouds from SensatUrban (Hu et al. 2022), covering several UK cities such as Cambridge, Oxford, and York. The 3D environments are rendered in a custom web-based aerial flight simulator with first-person 6-DoF control and an OpenStreetMap-aligned 2D minimap for real-world spatial context.

Landmark names and polygonal annotations are drawn from CityRefer (Miyanishi et al. 2023) and mapped onto 34 urban scenes (24 "seen," 10 "unseen"). Each object in this set is associated with multiple natural-language navigation goals describing target locations using real-world referents. MTurk operators collect human demonstration trajectories for each description, controlling the agent to the landmark and placing a final marker on target. Detailed quality control ensures the marker is within 30 m of ground-truth; the process yields 32,637 valid demonstration trajectories paired with language instructions and target object polygons (Lee et al., 2024).

2. Data Modalities, Annotation, and Structure

CityNav's annotation encompasses:

  • Visual Observations: First-person RGB-D (depth) images from an agent’s perspective within the simulator, spatially synchronized to the scene.
  • Language Goals: Natural-language descriptions (average 25 tokens, range 5–60) referencing landmark names, spatial relations, and contextual clues.
  • Spatial Grounding: Each trajectory is tightly coupled with a specific target object's polygonal coordinates, rendered both in 3D world space and as a channel on a 2D internal map. Each demonstration records full (x, y, z) navigation paths with timestamps.
  • JSON-Based Format: Annotations are provided as structured JSON entities. Each entry includes fields for the language instruction, agent trajectory coordinates, landmark bounding-box or polygon, and final target marker location.
  • Auxiliary Maps: Baseline methods leverage five-channel internal navigation maps: current viewport, cumulative explored area, landmark occupancy (from CityRefer), target detection (via GroundingDINO+MobileSAM), and local context (surroundings channel).

Central to the dataset is its alignment of densely sampled language instructions, georeferenced spatial maps, and visually grounded 3D scenes.

3. Scale, Splitting, and Statistical Properties

CityNav offers exceptional statistical coverage for real-world, city-scale aerial navigation:

  • Trajectories and Descriptions:
    • 32,637 demonstration trajectories.
    • 5,850 unique annotated landmarks/objects.
  • Scene and Split Protocol:
    • 34 scenes, partitioned as:
    • Train: 22,002 trajectories, 24 scenes.
    • Val Seen: 2,498, 24 scenes.
    • Val Unseen: 2,826, 4 "unseen" scenes.
    • Test Unseen: 5,311, 6 "unseen" scenes.
  • Path Length Distribution:
    • Human demonstration mean: 508 m (range ∼50–1200 m).
    • Shortest-path mean: 290 m.
  • Difficulty Stratification:
    • Easy: <171 m start-to-goal.
    • Medium: 171–258 m.
    • Hard: >258 m.
  • Vocabulary and Referents:
    • Descriptions average 25 tokens; 92.8% reference CityRefer objects.

Annotation required 711 MTurk worker-hours, with a rigorous two-pass quality filter ensuring high target proximity.

4. Task Definition, Evaluation Protocols, and Baselines

The canonical CityNav task is language-goal aerial VLN: Given initial RGB-D observations and a language instruction, the agent must explore the continuous 3D city space and stop within a defined proximity of the described target.

Evaluation Metrics:

Let NN be the number of evaluation episodes.

  • Navigation Error (NE):

NE=1Nixistopxigoal2NE = \frac{1}{N}\sum_{i} \|x_i^{stop} - x_i^{goal}\|_2

  • Success Rate (SR):

SR=1Ni[xistopxigoal220m]SR = \frac{1}{N} \sum_{i} [\|x_i^{stop} - x_i^{goal}\|_2 \leq 20\, {\rm m}]

  • Oracle Success Rate (OSR):

OSR=1Ni[minttrajectoryxi(t)xigoal220m]OSR = \frac{1}{N} \sum_{i} [ \min_{t\in \mathrm{trajectory}} \|x_i^{(t)} - x_i^{goal}\|_2 \leq 20\, {\rm m} ]

  • Success weighted by Path Length (SPL):

SPL=1Ni=1NSiLimax(Pi,Li)SPL = \frac{1}{N}\sum_{i=1}^N S_i \cdot \frac{L_i}{\max(P_i, L_i)}

where SiS_i is the binary success indicator, LiL_i the shortest start-goal path, and PiP_i the agent’s actual path length.

Baselines Tested:

  • Sequence-to-Sequence (ResNet image, LSTM instruction).
  • Cross-Modal Attention (instruction-guided visual grounding).
  • Map-based Goal Predictors (internal 2D map, CNN embedding, with/without vision-LLM–guided refinement).
  • Human performance as an upper bound.

Results reveal a substantial absolute gap (≈81%) between best agent and human success rates, even when agents leverage map-based planning (Lee et al., 2024).

5. Map Integration and Model Architectures

CityNav's primary advance is the integration of geo-referenced spatial reasoning in the VLN pipeline:

  • Landmark mapping: Five-channel internal maps encode current field-of-view, explored regions, static landmark polygons (CityRefer), detected targets, and semantic surroundings.
  • Spatial-Goal Policy: Baseline MGP (Map-based Goal Predictor) predicts 2D goal coordinates and progress, taking as input the fused map, visual, and language representations.
  • LLaVA-augmented Variant: Target localization is further refined using instruction–region alignment via a vision-LLM, embodying approaches analogous to Set-of-Mark prompting (Lee et al., 2024).

Ablation studies demonstrate landmark-channel removal leads to >10× drop in SR (from 6.38% to 0.47%) on unseen scenes, underscoring the necessity of semantic mapping.

6. Dataset Limitations and Prospective Directions

CityNav provides city-scale realism but remains limited to static city snapshots (no moving vehicles, pedestrians, or dynamic weather). Physical UAV flight dynamics are abstracted to discrete 6-DoF control, and all interactions are limited to navigation, with no onboard manipulation.

Future research directions suggested include:

  • Incorporation of dynamic scene content (temporal weather, moving agents).
  • Joint policy learning with explicit vision-language-control circuits and map priors.
  • End-to-end adaptation and scaling of large vision-LLMs for direct goal localization.
  • Extending to multi-modal/semantic annotation formats, including traffic signage and urban infrastructure.

These constraints delineate CityNav as a pure navigation and instruction-following benchmark, exceptionally well-suited to studying the intersection of geo-aware spatial grounding and language-based navigation, but not yet covering interactive or physically dynamic navigation scenarios.

CityNav addresses the persistent gap in aerial VLN resources by combining large-scale, human-annotated, city-level navigation with georeferenced spatial and linguistic cues. Its construction methodology departs from synthetic or street-level datasets (e.g., StreetLearn (Mirowski et al., 2019), DeepNav (Brahmbhatt et al., 2017), UrbanNav (Mei et al., 10 Dec 2025)) by emphasizing realistic, open-world tasks at altitude, expansive landmark sets, high annotation quality, and formal out-of-distribution evaluation splits.

Comparative evaluation with StreetLearn and DeepNav highlights CityNav’s contributions:

  • Realistic aerial perspective with continuous 3D navigation; prior datasets are mostly egocentric, ground-based, and graph-constrained.
  • Landmarks and goals are defined by natural language and geopolygonal spatial cues, not pre-specified points or synthetic regions.
  • Evaluation is explicitly split into “scene-seen” and “scene-unseen” regimes to robustly quantify generalization beyond training data.

This positions CityNav as a critical VLN resource for researchers seeking ecologically valid benchmarks for the next generation of grounded, map-aware, and language-instructed autonomous navigation systems (Lee et al., 2024, Cai et al., 19 May 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to CityNav Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube