Papers
Topics
Authors
Recent
2000 character limit reached

CityEQA-EC: Urban Embodied EQA Benchmark

Updated 17 December 2025
  • CityEQA-EC is a benchmark that defines over 1,400 human-validated tasks in a continuous 3D urban simulator to evaluate hierarchical planning and spatial reasoning.
  • It employs detailed annotation and validation pipelines to address challenges like environmental ambiguity, open-vocabulary queries, and perceptual variability in urban settings.
  • Evaluation metrics such as navigation accuracy, question answering accuracy, and mean time steps provide rigorous benchmarks for advancing embodied AI research.

CityEQA-EC is the first large-scale, open-vocabulary embodied question answering (EQA) benchmark focused on urban outdoor environments. Comprising 1,412 human-validated tasks, it operates within a single coherent 3D city world and is designed to capture the complexity of real urban scenes, agent interactions, and perceptual demands missing from prior indoor benchmarks. CityEQA-EC provides an environment for developing and evaluating embodied agents on question answering tasks that require hierarchical planning, spatial reasoning, and detailed visual inspection in an urban simulator framework (Zhao et al., 18 Feb 2025).

1. Motivation and Distinctive Challenges

Traditional EQA datasets have emphasized indoor domains such as House3D, AI2-THOR, and Matterport3D, where navigation and perception are relatively constrained. CityEQA-EC addresses the lack of outdoor, urban-centric EQA by introducing new sources of complexity:

  • Environmental ambiguity: Urban objects such as buildings, vehicles, and street furniture are visually complex and often similar, increasing task difficulty.
  • Hierarchical and long-horizon action: Agents must traverse distances of tens to hundreds of meters (macro-navigation) and still analyze small-scale details (micro-perception).
  • Open-vocabulary tasks: Questions are grounded in landmarks and spatial relationships, often invoking conversational, real-world terminology.
  • Perceptual variability: Urban features exhibit dramatic appearance changes due to angle, occlusion, and lighting, challenging object recognition and position estimation.

This formulation establishes CityEQA-EC as a critical resource for research into urban spatial intelligence and long-horizon embodied reasoning (Zhao et al., 18 Feb 2025).

2. Task Types and Dataset Structure

CityEQA-EC consists of 1,412 distinct tasks, all human-annotated and validated, each comprising an open-vocabulary question and free-form answer. The annotation strategy yields high diversity and rigor across six canonical question categories:

Category Example Template Approx. Count
Object Recognition “What is the color of the car immediately east of X?” 240
Object Counting “How many red buses are parked west of building Y?” 225
Spatial-Relation Query “Which landmark is directly north of the clock tower?” 230
Attribute Query “What brand is the café across from the museum?” 235
Comparison/Inference “Is the white bus larger than the yellow bus?” 230
World Knowledge “Which building houses the city museum?” 252

Each task is:

  • Defined by a tuple ξ=(e,q,y,p0)\xi = (e, q, y, p_0): environment ee (always EmbodiedCity), question string qq, answer string yy, and initial agent pose p0p_0.
  • Grounded relative to >200 uniquely annotated buildings used as reference landmarks.
  • Constrained to a 400 m × 400 m city region with spatial relationships referencing cardinal directions.

Statistics indicate mean question length ≈18 words (σ ≈4.5), and a mean of 2.9 objects involved per task (σ ≈1.0) (Zhao et al., 18 Feb 2025).

3. Urban Simulator and Sensory Modalities

Tasks are instantiated in the EmbodiedCity simulator (built atop Unreal Engine 4 and Microsoft AirSim). The simulator characteristics are:

  • Monolithic 3D city world: All agents and tasks operate within a single continuous environment, as opposed to disjoint “scenes.”
  • Spatial granularity: 400 meters per side region, 1 m grid resolution.
  • Static world elements: Buildings, vehicles (cars, buses, trucks), street furniture, signs, landmarks.
  • Sensory streams at each timestep tt:
    • RGB image IrgbtRH×W×3I^t_{rgb} \in \mathbb{R}^{H \times W \times 3}
    • Depth image IdtRH×WI^t_d \in \mathbb{R}^{H \times W}
  • Data format per task: JSON with environment, question, answer, initial and target poses, and a canonical observation pose snapshot.

There are no dynamic pedestrians or traffic in this release. This constraint enables precise evaluation and control (Zhao et al., 18 Feb 2025).

4. Annotation and Validation Protocol

The dataset is constructed via a multistage annotation pipeline:

  1. Raw QA generation: Annotators explore EmbodiedCity, generate open-ended questions and answers at sampled camera poses, and record ground-truth target poses for relevant objects (443 base question-answer pairs).
  2. Supplementation: Each base QA is expanded to four variants by sampling initial agent poses within 200 m of the target and enriching question phrasing with landmark-based spatial constraints, yielding at least 2,212 candidate tasks.
  3. Validation: Two reviewers per task ensure answerability, clarity, spatial disambiguation, valid initial pose (not colliding/noisy), and correct grammar. Resulting in 1,412 clean tasks.

The category distribution is balanced, with all tasks subjected to verification, and pose annotations cross-checked at a rate of 20% (Zhao et al., 18 Feb 2025).

5. Evaluation Protocols and Metrics

CityEQA-EC establishes rigorous, multi-modal evaluation protocols:

  • Question Answering Accuracy (QAA):
    • Each model output answer y^\hat{y} is scored using a GPT-4 LLM judge in a zero-shot prompt setting, returning a rating θ{1,2,3,4,5}\theta \in \{1, 2, 3, 4, 5\}.
    • Mean QAA is calculated as QAA=1Ni=1Nθi\langle \text{QAA} \rangle = \frac{1}{N} \sum_{i=1}^N \theta_i.
    • Human–LLM agreement is high (Spearman rs=0.85r_s = 0.85).
  • Navigation Accuracy (NA):
    • NA=number of tasks with pfinalptar2<3mNNA = \frac{\text{number of tasks with } \| p_{final} - p^{tar} \|_2 < 3 \text{m}}{N}
    • Also report mean error Dˉ=1Npfinalptar2\bar{D} = \frac{1}{N} \sum \| p_{final} - p^{tar} \|_2.
  • Mean Time Steps (MTS):
    • MTS=1Ni=1NTiMTS = \frac{1}{N}\sum_{i=1}^{N} T_i, with TiT_i capped at 50 per episode.
  • Path Efficiency / SPL (optional, prospective):
    • SPLi=SiiiSPL_i = S_i \cdot \frac{\ell_i^*}{\ell_i}, Si{0,1}S_i \in \{0, 1\} indicates navigation success.

These metrics jointly assess language understanding, perception, and goal-driven navigation (Zhao et al., 18 Feb 2025).

6. Data Access, Tools, and Reproducibility

CityEQA-EC is distributed under CC BY-NC-4.0 (data) and MIT (code) licenses. The project resources are:

  • Code and dataset: https://github.com/BiluYong/CityEQA.git, including JSON manifests and RGB/depth image samples through LFS.
  • Dependencies: Python 3.8+, PyTorch 2.x, Transformers, AirSim Python API, GroundSAM, GPT-4/GPT-4o (or Qwen2.5) API, and Unreal Engine 4.
  • Reproducibility: Fixed model seeds and prompt templates ensure consistency. Playback scripts allow any of the 1,412 tasks to be re-executed in EmbodiedCity with metric logging.

A plausible implication is that this infrastructure enables systematic benchmarking and ablation studies for embodied urban agents (Zhao et al., 18 Feb 2025).

7. Significance and Future Directions

CityEQA-EC establishes a new standard for evaluating embodied agents on urban question answering, surpassing limitations of prior indoor EQA datasets. Notably, the Planner-Manager-Actor (PMA) agent demonstrates 60.7% of human-level accuracy, but a notable gap to humans remains, particularly in visual reasoning and fine-grained urban perception (Zhao et al., 18 Feb 2025). This suggests substantial scope for methodological innovation in perception-driven exploration, localization, and open-world spatial language understanding within embodied AI.

As the first comprehensive benchmark for open-vocabulary EQA in realistic city environments, CityEQA-EC offers a foundation for future research into hierarchical planning, spatial cognition, and long-horizon reasoning in embodied systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to CityEQA-EC Dataset.