CityEQA-EC: Urban Embodied EQA Benchmark
- CityEQA-EC is a benchmark that defines over 1,400 human-validated tasks in a continuous 3D urban simulator to evaluate hierarchical planning and spatial reasoning.
- It employs detailed annotation and validation pipelines to address challenges like environmental ambiguity, open-vocabulary queries, and perceptual variability in urban settings.
- Evaluation metrics such as navigation accuracy, question answering accuracy, and mean time steps provide rigorous benchmarks for advancing embodied AI research.
CityEQA-EC is the first large-scale, open-vocabulary embodied question answering (EQA) benchmark focused on urban outdoor environments. Comprising 1,412 human-validated tasks, it operates within a single coherent 3D city world and is designed to capture the complexity of real urban scenes, agent interactions, and perceptual demands missing from prior indoor benchmarks. CityEQA-EC provides an environment for developing and evaluating embodied agents on question answering tasks that require hierarchical planning, spatial reasoning, and detailed visual inspection in an urban simulator framework (Zhao et al., 18 Feb 2025).
1. Motivation and Distinctive Challenges
Traditional EQA datasets have emphasized indoor domains such as House3D, AI2-THOR, and Matterport3D, where navigation and perception are relatively constrained. CityEQA-EC addresses the lack of outdoor, urban-centric EQA by introducing new sources of complexity:
- Environmental ambiguity: Urban objects such as buildings, vehicles, and street furniture are visually complex and often similar, increasing task difficulty.
- Hierarchical and long-horizon action: Agents must traverse distances of tens to hundreds of meters (macro-navigation) and still analyze small-scale details (micro-perception).
- Open-vocabulary tasks: Questions are grounded in landmarks and spatial relationships, often invoking conversational, real-world terminology.
- Perceptual variability: Urban features exhibit dramatic appearance changes due to angle, occlusion, and lighting, challenging object recognition and position estimation.
This formulation establishes CityEQA-EC as a critical resource for research into urban spatial intelligence and long-horizon embodied reasoning (Zhao et al., 18 Feb 2025).
2. Task Types and Dataset Structure
CityEQA-EC consists of 1,412 distinct tasks, all human-annotated and validated, each comprising an open-vocabulary question and free-form answer. The annotation strategy yields high diversity and rigor across six canonical question categories:
| Category | Example Template | Approx. Count |
|---|---|---|
| Object Recognition | “What is the color of the car immediately east of X?” | 240 |
| Object Counting | “How many red buses are parked west of building Y?” | 225 |
| Spatial-Relation Query | “Which landmark is directly north of the clock tower?” | 230 |
| Attribute Query | “What brand is the café across from the museum?” | 235 |
| Comparison/Inference | “Is the white bus larger than the yellow bus?” | 230 |
| World Knowledge | “Which building houses the city museum?” | 252 |
Each task is:
- Defined by a tuple : environment (always EmbodiedCity), question string , answer string , and initial agent pose .
- Grounded relative to >200 uniquely annotated buildings used as reference landmarks.
- Constrained to a 400 m × 400 m city region with spatial relationships referencing cardinal directions.
Statistics indicate mean question length ≈18 words (σ ≈4.5), and a mean of 2.9 objects involved per task (σ ≈1.0) (Zhao et al., 18 Feb 2025).
3. Urban Simulator and Sensory Modalities
Tasks are instantiated in the EmbodiedCity simulator (built atop Unreal Engine 4 and Microsoft AirSim). The simulator characteristics are:
- Monolithic 3D city world: All agents and tasks operate within a single continuous environment, as opposed to disjoint “scenes.”
- Spatial granularity: 400 meters per side region, 1 m grid resolution.
- Static world elements: Buildings, vehicles (cars, buses, trucks), street furniture, signs, landmarks.
- Sensory streams at each timestep :
- RGB image
- Depth image
- Data format per task: JSON with environment, question, answer, initial and target poses, and a canonical observation pose snapshot.
There are no dynamic pedestrians or traffic in this release. This constraint enables precise evaluation and control (Zhao et al., 18 Feb 2025).
4. Annotation and Validation Protocol
The dataset is constructed via a multistage annotation pipeline:
- Raw QA generation: Annotators explore EmbodiedCity, generate open-ended questions and answers at sampled camera poses, and record ground-truth target poses for relevant objects (443 base question-answer pairs).
- Supplementation: Each base QA is expanded to four variants by sampling initial agent poses within 200 m of the target and enriching question phrasing with landmark-based spatial constraints, yielding at least 2,212 candidate tasks.
- Validation: Two reviewers per task ensure answerability, clarity, spatial disambiguation, valid initial pose (not colliding/noisy), and correct grammar. Resulting in 1,412 clean tasks.
The category distribution is balanced, with all tasks subjected to verification, and pose annotations cross-checked at a rate of 20% (Zhao et al., 18 Feb 2025).
5. Evaluation Protocols and Metrics
CityEQA-EC establishes rigorous, multi-modal evaluation protocols:
- Question Answering Accuracy (QAA):
- Each model output answer is scored using a GPT-4 LLM judge in a zero-shot prompt setting, returning a rating .
- Mean QAA is calculated as .
- Human–LLM agreement is high (Spearman ).
- Navigation Accuracy (NA):
- Also report mean error .
- Mean Time Steps (MTS):
- , with capped at 50 per episode.
- Path Efficiency / SPL (optional, prospective):
- , indicates navigation success.
These metrics jointly assess language understanding, perception, and goal-driven navigation (Zhao et al., 18 Feb 2025).
6. Data Access, Tools, and Reproducibility
CityEQA-EC is distributed under CC BY-NC-4.0 (data) and MIT (code) licenses. The project resources are:
- Code and dataset: https://github.com/BiluYong/CityEQA.git, including JSON manifests and RGB/depth image samples through LFS.
- Dependencies: Python 3.8+, PyTorch 2.x, Transformers, AirSim Python API, GroundSAM, GPT-4/GPT-4o (or Qwen2.5) API, and Unreal Engine 4.
- Reproducibility: Fixed model seeds and prompt templates ensure consistency. Playback scripts allow any of the 1,412 tasks to be re-executed in EmbodiedCity with metric logging.
A plausible implication is that this infrastructure enables systematic benchmarking and ablation studies for embodied urban agents (Zhao et al., 18 Feb 2025).
7. Significance and Future Directions
CityEQA-EC establishes a new standard for evaluating embodied agents on urban question answering, surpassing limitations of prior indoor EQA datasets. Notably, the Planner-Manager-Actor (PMA) agent demonstrates 60.7% of human-level accuracy, but a notable gap to humans remains, particularly in visual reasoning and fine-grained urban perception (Zhao et al., 18 Feb 2025). This suggests substantial scope for methodological innovation in perception-driven exploration, localization, and open-world spatial language understanding within embodied AI.
As the first comprehensive benchmark for open-vocabulary EQA in realistic city environments, CityEQA-EC offers a foundation for future research into hierarchical planning, spatial cognition, and long-horizon reasoning in embodied systems.