ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation (2301.13166v3)
Abstract: The ability to accurately locate and navigate to a specific object is a crucial capability for embodied agents that operate in the real world and interact with objects to complete tasks. Such object navigation tasks usually require large-scale training in visual environments with labeled objects, which generalizes poorly to novel objects in unknown environments. In this work, we present a novel zero-shot object navigation method, Exploration with Soft Commonsense constraints (ESC), that transfers commonsense knowledge in pre-trained models to open-world object navigation without any navigation experience nor any other training on the visual environments. First, ESC leverages a pre-trained vision and LLM for open-world prompt-based grounding and a pre-trained commonsense LLM for room and object reasoning. Then ESC converts commonsense knowledge into navigation actions by modeling it as soft logic predicates for efficient exploration. Extensive experiments on MP3D, HM3D, and RoboTHOR benchmarks show that our ESC method improves significantly over baselines, and achieves new state-of-the-art results for zero-shot object navigation (e.g., 288% relative Success Rate improvement than CoW on MP3D).
- Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
- Zero experience required: Plug & play modular transfer learning for semantic visual navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17031–17041, June 2022.
- On evaluation of embodied navigation agents. CoRR, abs/1807.06757, 2018. URL http://arxiv.org/abs/1807.06757.
- Hinge-loss markov random fields and probabilistic soft logic. J. Mach. Learn. Res., 18(1):3846–3912, jan 2017. ISSN 1532-4435.
- Objectnav revisited: On evaluation of embodied agents navigating to objects. CoRR, abs/2006.13171, 2020. URL https://arxiv.org/abs/2006.13171.
- A persistent spatial semantic representation for high-level natural language instruction execution. In arXiv preprint arXiv:2107.05612, 2022.
- Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1–122, jan 2011. ISSN 1935-8237. doi: 10.1561/2200000016. URL https://doi.org/10.1561/2200000016.
- Matterport3D: Learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV), 2017.
- Object goal navigation using goal-oriented semantic exploration. In In Neural Information Processing Systems (NeurIPS), 2020a.
- Learning to explore using active neural slam. In International Conference on Learning Representations (ICLR), 2020b.
- Open-vocabulary queryable scene representations for real world planning. In arXiv preprint arXiv:2209.09874, 2022a.
- Learning active camera for multi-object navigation. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/forum?id=iH4eyI5A7o.
- Weakly-supervised multi-granularity map learning for vision-and-language navigation. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022c. URL https://openreview.net/forum?id=gyZMZBiI9Cw.
- Integrating egocentric localization for more realistic point-goal navigation agents. CoRL, 2020.
- Robothor: An open simulation-to-real embodied ai platform. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- Procthor: Large-scale embodied ai using procedural generation, 2022. URL https://arxiv.org/abs/2206.06994.
- Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation. CVPR, 2023.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. doi: 10.1109/CVPR.2016.90.
- Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988, 2017. doi: 10.1109/ICCV.2017.322.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, 2021.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022a.
- Inner monologue: Embodied reasoning through planning with language models. In arXiv preprint arXiv:2207.05608, 2022b.
- Simple but effective: Clip embeddings for embodied ai. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022.
- Unifiedqa-v2: Stronger generalization via broader cross-format training. arXiv preprint arXiv:2202.12359, 2022.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
- Grounded language-image pre-training. In CVPR, 2022.
- Neuro-symbolic causal language planning with commonsense prompting. ArXiv, abs/2206.02928, 2022.
- ZSON: Zero-shot object-goal navigation using multimodal goal embeddings. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=VY1dqOF2RjC.
- Thda: Treasure hunt data augmentation for semantic navigation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 15354–15363, 2021. doi: 10.1109/ICCV48922.2021.01509.
- Memory-augmented reinforcement learning for image-goal navigation. arXiv preprint arXiv:2101.05181, 2021.
- FILM: Following instructions in language with modular methods. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=qI4542Y2s1D.
- Training language models to follow instructions with human feedback. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=TG8KACxEON.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=-v4OuqNs5P.
- Poni: Potential functions for objectgoal navigation with interaction-free learning. In Computer Vision and Pattern Recognition (CVPR), 2022 IEEE Conference on. IEEE, 2022.
- Habitat-web: Learning embodied object-search strategies from human demonstrations at scale. In CVPR, 2022.
- Tidee: Tidying up novel rooms using visuo-semantic commonsense priors. In Avidan, S., Brostow, G., Cissé, M., Farinella, G. M., and Hassner, T. (eds.), Computer Vision – ECCV 2022, pp. 480–496, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19842-7.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626, 2017. doi: 10.1109/ICCV.2017.74.
- Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action. 2022. URL https://arxiv.org/abs/2207.04429.
- Skill induction and planning with latent language. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1713–1726, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.120. URL https://aclanthology.org/2022.acl-long.120.
- How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021.
- Yamauchi, B. A frontier-based approach for autonomous exploration. In Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97. ’Towards New Computational Principles for Robotics and Automation’, pp. 146–151, 1997. doi: 10.1109/CIRA.1997.613851.
- Visual semantic navigation using scene priors. In ICLR, 2019.
- Auxiliary tasks and exploration enable objectgoal navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16117–16126, October 2021.
- Semantic linking maps for active visual object search (extended abstract). In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pp. 4864–4868, 8 2021. doi: 10.24963/ijcai.2021/667. URL https://doi.org/10.24963/ijcai.2021/667. Sister Conferences Best Papers.
- Jarvis: A neuro-symbolic commonsense reasoning framework for conversational embodied agents. arXiv preprint arXiv:2208.13266, 2022.
- Kaiwen Zhou (42 papers)
- Kaizhi Zheng (11 papers)
- Connor Pryor (9 papers)
- Yilin Shen (41 papers)
- Hongxia Jin (64 papers)
- Lise Getoor (39 papers)
- Xin Eric Wang (74 papers)