ELA-ZSON: Efficient Layout-Aware Zero-Shot Object Navigation Agent with Hierarchical Planning (2505.06131v1)

Published 9 May 2025 in cs.RO

Abstract: We introduce ELA-ZSON, an efficient layout-aware zero-shot object navigation (ZSON) approach designed for complex multi-room indoor environments. By planning hierarchically leveraging a global topologigal map with layout information and local imperative approach with detailed scene representation memory, ELA-ZSON achieves both efficient and effective navigation. The process is managed by an LLM-powered agent, ensuring seamless effective planning and navigation, without the need for human interaction, complex rewards, or costly training. Our experimental results on the MP3D benchmark achieves 85\% object navigation success rate (SR) and 79\% success rate weighted by path length (SPL) (over 40\% point improvement in SR and 60\% improvement in SPL compared to exsisting methods). Furthermore, we validate the robustness of our approach through virtual agent and real-world robotic deployment, showcasing its capability in practical scenarios. See https://anonymous.4open.science/r/ELA-ZSON-C67E/ for details.

Summary

An Overview of ELA-ZSON: Efficient Layout-Aware Zero-Shot Object Navigation Agent with Hierarchical Planning

The paper presents ELA-ZSON, a novel approach to zero-shot object navigation (ZSON) in complex multi-room indoor environments. ELA-ZSON integrates hierarchical planning strategies and is tailored for efficient and effective navigation, a key capability for household robotics. It uniquely combines the use of a global topological map with layout information, and a local imperative approach using detailed scene representation memory.

ELA-ZSON leverages LLMs to power the navigation agent, managing tasks autonomously without requiring human interaction, complex rewards, or expensive training regimes. This approach contrasts traditional methods reliant on extensive training data and rewards, highlighting ELA-ZSON’s efficiency and practicality for real-world deployment.

Methodology

The framework operates on a dual-level information hierarchy consisting of:

Global Topological Map: This serves as the foundation for coarse route planning using layout information.
Local Imperative Approach: It supports detailed scene representation memory to adjust navigation dynamically.

The hierarchical planning paradigm begins by encoding user instructions—textual, visual, or positional—into embeddings using vision-LLMs. The embeddings are then utilized to query the implicit neural function representing the environmental scene, determining the position of the target object that matches the user’s query.

For the global route, waypoints are identified by querying a topological graph connecting the robot’s starting point to the destination, generating a sequence of vertices the robot needs to traverse. These vertices typically denote major structures, such as room entries or connecting corridors. Local navigation is handled through dense waypoints between each pair of global waypoints, accommodating unexpected environmental changes with greater flexibility and robustness.

Results

ELA-ZSON demonstrates advanced performance metrics in various experimental setups. On the MP3D benchmark, it achieves an object navigation success rate (SR) of 85% and a success rate weighted by path length (SPL) of 79%, substantially outperforming previous state-of-the-art methods. These results underscore its efficiency and robustness in diverse indoor scenes, with significant improvements evident from more than 40% and 60% points in SR and SPL respectively compared to existing methods.

Additionally, the approach’s applicability and resilience were validated through both virtual agent experiments and real-world robotic implementation, showcasing ELA-ZSON’s capability to handle practical deployment scenarios effectively.

Implications and Future Work

ELA-ZSON contributes significant insights into hierarchical planning for robotic navigation, advocating for the integration of high-level layout awareness with detailed local adaptability. This possesses potential implications for the development of robust, autonomous navigation systems deployable in unstructured environments without prior human calibration or extensive training.

Future research could explore enhancing local planning strategies to leverage scene memory for refined obstacle avoidance and efficiency. Additionally, developing mechanisms for real-time scene updates to integrate detected changes dynamically into scene representations can further advance robustness and adaptability under varying conditions.