DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation

Published 28 May 2025 in cs.RO and cs.AI | (2505.21969v3)

Abstract: Adaptive navigation in unfamiliar environments is crucial for household service robots but remains challenging due to the need for both low-level path planning and high-level scene understanding. While recent vision-LLM (VLM) based zero-shot approaches reduce dependence on prior maps and scene-specific training data, they face significant limitations: spatiotemporal discontinuity from discrete observations, unstructured memory representations, and insufficient task understanding leading to navigation failures. We propose DORAEMON (Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation), a novel cognitive-inspired framework consisting of Ventral and Dorsal Streams that mimics human navigation capabilities. The Dorsal Stream implements the Hierarchical Semantic-Spatial Fusion and Topology Map to handle spatiotemporal discontinuities, while the Ventral Stream combines RAG-VLM and Policy-VLM to improve decision-making. Our approach also develops Nav-Ensurance to ensure navigation safety and efficiency. We evaluate DORAEMON on the HM3D, MP3D, and GOAT datasets, where it achieves state-of-the-art performance on both success rate (SR) and success weighted by path length (SPL) metrics, significantly outperforming existing methods. We also introduce a new evaluation metric (AORI) to assess navigation intelligence better. Comprehensive experiments demonstrate DORAEMON's effectiveness in zero-shot autonomous navigation without requiring prior map building or pre-training.

Abstract PDF Upgrade to Chat

Summary

Overview of DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation

The paper introduces DORAEMON, a novel framework aimed at improving autonomous navigation for household service robots in unfamiliar environments. The complexity of this task arises from the necessity to balance low-level path planning with high-level scene understanding. Traditional navigation strategies often rely on pre-built maps or extensive scene-specific training data, which are impractical in novel settings due to the significant time and manual effort required. Recent zero-shot approaches utilizing vision-language models (VLMs) offer an intriguing alternative by using textual descriptions and visual inputs to perform navigation without predetermined scene data. However, these methods are hampered by spatiotemporal discontinuity, unstructured memory representations, and insufficient understanding of task goals.

Key Contributions of DORAEMON

Dual-Stream Architecture: Inspired by cognitive science, DORAEMON features two distinct streams, ventral and dorsal, emulating human navigation faculties. The Dorsal Stream employs Hierarchical Semantic-Spatial Fusion and a Topology Map to address spatiotemporal discontinuities, while the Ventral Stream utilizes Retrieval-augmented Generation (RAG-VLM) and Policy-VLM for improved task comprehension and decision-making.
Memory-Oriented Navigation: The structured memory architecture within DORAEMON assists robots in maintaining a coherent understanding of their interactions with unseen environments, significantly enhancing their navigation capabilities. The system records spatial relationships and organizes semantic information hierarchically, allowing effective retrieval and reasoning during navigation.
Nav-Ensurance System: A novel aspect of this framework includes the Nav-Ensurance system, establishing multidimensional stuck detection, context-aware escape strategies, and adaptive precision navigation mechanisms. This addition addresses critical issues in reliability and efficiency during navigation tasks.
Evaluation Metric - AORI: A new Adaptive Online Route Index (AORI) metric is proposed to assess the system's navigation intelligence, focusing on spatial overlap and exploration density.

Experimental Results

The evaluation conducted on the HM3D, MP3D, and GOAT datasets demonstrates that DORAEMON achieves superior performance, particularly regarding success rate (SR) and success weighted by path length (SPL). The introduced AORI metric further illustrates the agent's efficiency by penalizing redundant exploration. The paper benchmarks DORAEMON against existing zero-shot methods and highlights significant improvements across various models and datasets.

Implications and Future Directions

The proposed DORAEMON framework marks significant advancements in autonomous navigation systems by integrating sophisticated memory-oriented methodologies and cognitive principles. The implications of this research are manifold, ranging from improved robotic assistance in household environments to broader applications in unstructured and dynamic settings. The paper provides a robust foundation for further research into memory-enhanced robotic navigation, offering pathways to efficient, autonomous exploration in unprecedented terrains.

Looking ahead, the evolution of vision-language models alongside continued cognitive sciences research holds promising potential for refining frameworks like DORAEMON. As models enhance their semantic reasoning capabilities, the integration of adaptable and context-aware navigation systems could revolutionize autonomous robotic interactions with novel environments.