Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DOZE: A Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments (2402.19007v2)

Published 29 Feb 2024 in cs.CV and cs.RO

Abstract: Zero-Shot Object Navigation (ZSON) requires agents to autonomously locate and approach unseen objects in unfamiliar environments and has emerged as a particularly challenging task within the domain of Embodied AI. Existing datasets for developing ZSON algorithms lack consideration of dynamic obstacles, object attribute diversity, and scene texts, thus exhibiting noticeable discrepancies from real-world situations. To address these issues, we propose a Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments (DOZE) that comprises ten high-fidelity 3D scenes with over 18k tasks, aiming to mimic complex, dynamic real-world scenarios. Specifically, DOZE scenes feature multiple moving humanoid obstacles, a wide array of open-vocabulary objects, diverse distinct-attribute objects, and valuable textual hints. Besides, different from existing datasets that only provide collision checking between the agent and static obstacles, we enhance DOZE by integrating capabilities for detecting collisions between the agent and moving obstacles. This novel functionality enables the evaluation of the agents' collision avoidance abilities in dynamic environments. We test four representative ZSON methods on DOZE, revealing substantial room for improvement in existing approaches concerning navigation efficiency, safety, and object recognition accuracy. Our dataset can be found at https://DOZE-Dataset.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” Advances in Neural Information Processing Systems, vol. 33, pp. 4247–4258, 2020.
  2. P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3674–3683.
  3. D. Batra, A. X. Chang, S. Chernova, A. J. Davison, J. Deng, V. Koltun, S. Levine, J. Malik, I. Mordatch, R. Mottaghi et al., “Rearrangement: A challenge for embodied ai,” arXiv preprint arXiv:2011.01975, 2020.
  4. A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” arXiv preprint arXiv:1709.06158, 2017.
  5. S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang et al., “Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai,” arXiv preprint arXiv:2109.08238, 2021.
  6. M. Deitke, W. Han, A. Herrasti, A. Kembhavi, E. Kolve, R. Mottaghi, J. Salvador, D. Schwenk, E. VanderBilt, M. Wallingford et al., “Robothor: An open simulation-to-real embodied ai platform,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3164–3174.
  7. M. Khanna, Y. Mao, H. Jiang, S. Haresh, B. Schacklett, D. Batra, A. Clegg, E. Undersander, A. X. Chang, and M. Savva, “Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation,” arXiv preprint arXiv:2306.11290, 2023.
  8. P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021.
  9. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8821–8831.
  10. E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y. Zhu et al., “Ai2-thor: An interactive 3d environment for visual ai,” arXiv preprint arXiv:1712.05474, 2017.
  11. A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y. Zhao, J. Turner, N. Maestre, M. Mukadam, D. S. Chaplot, O. Maksymets et al., “Habitat 2.0: Training home assistants to rearrange their habitat,” Advances in Neural Information Processing Systems, vol. 34, pp. 251–266, 2021.
  12. A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “Zson: Zero-shot object-goal navigation using multimodal goal embeddings,” Advances in Neural Information Processing Systems, vol. 35, pp. 32 340–32 352, 2022.
  13. S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 171–23 181.
  14. D. Shah, M. Equi, B. Osinski, F. Xia, B. Ichter, and S. Levine, “Navigation with large language models: Semantic guesswork as a heuristic for planning,” arXiv preprint arXiv:2310.10103, 2023.
  15. G. Zhou, Y. Hong, and Q. Wu, “Navgpt: Explicit reasoning in vision-and-language navigation with large language models,” arXiv preprint arXiv:2305.16986, 2023.
  16. K. Yadav, A. Majumdar, R. Ramrakhya, N. Yokoyama, A. Baevski, Z. Kira, O. Maksymets, and D. Batra, “Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,” arXiv preprint arXiv:2303.07798, 2023.
  17. F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese, “Gibson env: Real-world perception for embodied agents,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9068–9079.
  18. H. Fu, B. Cai, L. Gao, L.-X. Zhang, J. Wang, C. Li, Q. Zeng, C. Sun, R. Jia, B. Zhao, and H. Zhang, “3d-front: 3d furnished rooms with layouts and semantics,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 10 933–10 942.
  19. N. Yokoyama, Q. Luo, D. Batra, and S. Ha, “Benchmarking augmentation methods for learning robust navigation agents: the winning entry of the 2021 igibson challenge,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2022, pp. 1748–1755.
  20. Z. Wang, C. Lu, Y. Wang, F. Bao, C. Li, H. Su, and J. Zhu, “Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation,” arXiv preprint arXiv:2305.16213, 2023.
  21. C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y. Lin, “Magic3d: High-resolution text-to-3d content creation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 300–309.
  22. B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text-to-3d using 2d diffusion,” arXiv preprint arXiv:2209.14988, 2022.
  23. J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y. Yan, X. Pan, J. Yon, Y. Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe, “The Replica dataset: A digital replica of indoor spaces,” arXiv preprint arXiv:1906.05797, 2019.
  24. D. Batra, A. Gokaslan, A. Kembhavi, O. Maksymets, R. Mottaghi, M. Savva, A. Toshev, and E. Wijmans, “Objectnav revisited: On evaluation of embodied agents navigating to objects,” arXiv preprint arXiv:2006.13171, 2020.
  25. A. Murali, T. Chen, K. V. Alwala, D. Gandhi, L. Pinto, S. Gupta, and A. Gupta, “Pyrobot: An open-source robotics framework for research and benchmarking,” arXiv preprint arXiv:1906.08236, 2019.
  26. P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva et al., “On evaluation of embodied navigation agents,” arXiv preprint arXiv:1807.06757, 2018.
  27. V. S. Dorbala, J. F. Mullen Jr, and D. Manocha, “Can an embodied agent find your” cat-shaped mug”? llm-based zero-shot object navigation,” arXiv preprint arXiv:2303.03480, 2023.
  28. B. Yu, H. Kasaei, and M. Cao, “L3mvn: Leveraging large language models for visual target navigation,” arXiv preprint arXiv:2304.05501, 2023.
  29. B. Yamauchi, “A frontier-based approach for autonomous exploration,” in Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation’.   IEEE, 1997, pp. 146–151.
  30. J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning.   PMLR, 2022, pp. 12 888–12 900.
Citations (4)

Summary

  • The paper introduces a dataset that simulates dynamic, complex environments for zero-shot navigation by incorporating moving obstacles, diverse objects, and textual hints.
  • It evaluates existing navigation methods in dynamic scenes, revealing significant limitations in collision detection and open-vocabulary object generalization.
  • The study suggests that hint-assisted navigation is a promising direction for developing more adaptive and real-world-ready AI systems.

Introducing DOZE: A Dataset Tailored for Zero-Shot Object Navigation in Dynamic Environments

Overview of DOZE

The dynamic and uncertain nature of real-world environments presents a significant challenge for embodied AI systems tasked with navigation and object recognition. Most existing datasets fail to capture the complexity of navigating through environments with moving obstacles, diverse object attributes, and the incidental presence of textual hints. Addressing this gap, the Dataset for Open-Vocabulary Zero-Shot Object Navigation in Dynamic Environments (DOZE) introduces a series of high-fidelity 3D scenes designed to mimic the unpredictability and diversity of the real world.

Key Features of DOZE

DOZE stands out in several respects:

  • Dynamic Obstacles: Unlike conventional datasets that predominantly focus on static environments, DOZE incorporates moving humanoid obstacles, introducing a layer of temporal complexity that requires agile and predictive navigation strategies.
  • Diversity in Object Representation: DOZE features a wide range of open-vocabulary objects, including those with distinct spatial and appearance attributes, bolstering the system's ability to generalize across unseen object categories.
  • Textual Hints for Navigation: Unique to DOZE is the integration of textual hints within the environment, representing a step towards leveraging multimodal data for enhancing navigation efficacy.
  • Enhanced Collision Detection: Moving beyond traditional collision checking, DOZE implements the capability to detect collisions between the agent and dynamic obstacles, serving as a measure of the agent’s adaptation to environmental changes.

Evaluation and Insights

Upon evaluating four established Zero-Shot Object Navigation methods on DOZE, it becomes evident that existing strategies exhibit substantial room for improvement. The assessed methods, even when augmented with collision-avoidance mechanisms, struggled against the dataset's dynamic obstacles and diverse object types. However, the hint-assisted navigation approach preliminarily explored shows promise in guiding agents more efficiently to their goals, suggesting an intriguing direction for future research.

Implications and Future Directions

The DOZE dataset not only highlights the current limitations of AI agents in dealing with dynamic and complex environments but also sets the stage for the development of more robust and adaptive navigation systems. The inclusion of moving obstacles, diverse object attributes, and textual hints underscores the necessity of multimodal perception and agile decision-making in navigation tasks.

Some speculative avenues for further exploration include:

  • Improving Situational Awareness: Developing methods that can more effectively predict the trajectories of dynamic obstacles.
  • Enhanced Object Recognition: Refining object detection and classification models to better handle open-vocabulary items and objects with subtle attribute differences.
  • Leveraging Textual Hints: Expanding the ability of navigation systems to process and act upon environmental textual information.

In conclusion, DOZE offers a richer, more challenging benchmark for Zero-Shot Object Navigation, with its dynamic obstacles, diverse object representations, and integration of textual hints paving the way for the development of more capable, real-world-ready AI navigation systems.

X Twitter Logo Streamline Icon: https://streamlinehq.com