CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space (2502.12532v3)

Published 18 Feb 2025 in cs.AI

Abstract: Embodied Question Answering (EQA) has primarily focused on indoor environments, leaving the complexities of urban settings-spanning environment, action, and perception-largely unexplored. To bridge this gap, we introduce CityEQA, a new task where an embodied agent answers open-vocabulary questions through active exploration in dynamic city spaces. To support this task, we present CityEQA-EC, the first benchmark dataset featuring 1,412 human-annotated tasks across six categories, grounded in a realistic 3D urban simulator. Moreover, we propose Planner-Manager-Actor (PMA), a novel agent tailored for CityEQA. PMA enables long-horizon planning and hierarchical task execution: the Planner breaks down the question answering into sub-tasks, the Manager maintains an object-centric cognitive map for spatial reasoning during the process control, and the specialized Actors handle navigation, exploration, and collection sub-tasks. Experiments demonstrate that PMA achieves 60.7% of human-level answering accuracy, significantly outperforming competitive baselines. While promising, the performance gap compared to humans highlights the need for enhanced visual reasoning in CityEQA. This work paves the way for future advancements in urban spatial intelligence. Dataset and code are available at https://github.com/BiluYong/CityEQA.git.

Summary

The paper introduces CityEQA, the first benchmark dataset for embodied question answering in complex 3D urban environments, featuring open-vocabulary tasks.
It proposes PMA, a hierarchical LLM agent architecture that achieves 60.7% accuracy on CityEQA tasks, significantly outperforming baseline methods but showing a gap from human performance.
The study highlights the significant challenges of urban EQA, particularly in visual reasoning, emphasizing the need for enhanced agent capabilities to bridge the performance gap.

The paper "CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space" proposes a novel approach for tackling the task of Embodied Question Answering (EQA) in complex urban environments through a new benchmark, CityEQA. Unlike traditional EQA tasks that focus on indoor settings, CityEQA challenges an agent to answer open-vocabulary questions by exploring dynamic city spaces, presenting unique challenges in environmental, action, and perception complexity.

Key Contributions:

CityEQA-EC Dataset:
- A benchmark dataset called CityEQA-EC is introduced, comprising 1,412 human-annotated tasks across six categories, within a realistic 3D urban simulator. The tasks are designed with open-vocabulary questions that require urban landmark identification and spatial reasoning.
PMA (Planner-Manager-Actor) Agent:
- A novel hierarchical agent architecture, PMA, is proposed to address CityEQA tasks. It consists of three components:
  - Planner: Decomposes the question into sub-tasks (navigation, exploration, and collection) for long-horizon task execution.
  - Manager: Maintains an object-centric cognitive map for spatial reasoning and oversees task execution.
  - Actor: Handles specific sub-tasks through modules optimized for navigation and exploration. The Collector integrates a Multi-Modal LLM (MM-LLM) for answer generation, refining visual observations.
Experimental Results:
- PMA's performance significantly outperforms traditional frontier-based exploration agents but achieves 60.7% of human-level answering accuracy. This indicates efficient navigation and exploration strategies, though a notable performance gap remains, highlighting the need for improved visual reasoning capabilities.
Challenges and Future Directions:
- The paper acknowledges the considerable complexity of urban EQA tasks. The results emphasize the necessity to enhance visual thinking and reasoning capabilities within embodied agents for urban settings to bridge the performance gap with humans.

Technical Insights:

CityEQA environments introduce multiple complexities:
- Environmental complexity involves ambiguous objects in the urban landscape that are often similar in appearance.
- Action complexity demands cross-scale movement strategies to efficiently parse vast urban spaces.
- Perception complexity arises from varied observations due to the dynamic nature of urban settings.
The dataset collection involves two steps: raw question-answer generation in a simulated urban environment with specified poses, followed by task supplementation and validation to ensure task clarity and uniqueness.
The PMA agent's design leverages LLMs for reasoning and planning, employing a detailed map-based approach to manage and process spatial information effectively.

In summary, the paper makes a substantial contribution to the field of embodied AI by addressing the underexplored area of question answering in city spaces. The proposed benchmarks and agent model provide a foundation for further research in urban spatial intelligence and embodied question answering.

PDF Markdown

GitHub

GitHub - BiluYong/CityEQA (2 stars)

CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space (2502.12532v3)

Summary

Related Papers

GitHub