Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation (2405.05363v1)

Published 8 May 2024 in cs.CV and cs.RO

Abstract: In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-LLM (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. D. S. Chaplot, D. Gandhi, A. Gupta, and R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” in In Neural Information Processing Systems (NeurIPS), 2020.
  2. A. Majumdar, G. Aggarwal, B. S. Devnani, J. Hoffman, and D. Batra, “ZSON: Zero-shot object-goal navigation using multimodal goal embeddings,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=VY1dqOF2RjC
  3. Q. Zhao, L. Zhang, B. He, H. Qiao, and Z. Liu, “Zero-shot object goal visual navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 2025–2031.
  4. S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” CVPR, 2023.
  5. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:231591445
  6. A. Singh, R. Hu, V. Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “FLAVA: A foundational language and vision alignment model,” in CVPR, 2022.
  7. M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, X. Wang, X. Zhai, T. Kipf, and N. Houlsby, “Simple open-vocabulary object detection,” in Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X.   Berlin, Heidelberg: Springer-Verlag, 2022, p. 728–755. [Online]. Available: https://doi.org/10.1007/978-3-031-20080-9˙42
  8. S. Song, S. P. Lichtenberg, and J. Xiao, “Sun rgb-d: A rgb-d scene understanding benchmark suite,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 567–576.
  9. Z. Al-Halah, S. K. Ramakrishnan, and K. Grauman, “Zero experience required: Plug & play modular transfer learning for semantic visual navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 031–17 041.
  10. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433.
  11. Y. Xie, L. Zhou, X. Dai, L. Yuan, N. Bach, C. Liu, and M. Zeng, “Visual clues: Bridging vision and language foundations for image paragraph captioning,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 287–17 300, 2022.
  12. F. Liu, K. Lin, L. Li, J. Wang, Y. Yacoob, and L. Wang, “Aligning large multi-modal model with robust instruction tuning,” arXiv preprint arXiv:2306.14565, 2023.
  13. W. Feng, W. Zhu, T.-j. Fu, V. Jampani, A. Akula, X. He, S. Basu, X. E. Wang, and W. Y. Wang, “Layoutgpt: Compositional visual planning and generation with large language models,” arXiv preprint arXiv:2305.15393, 2023.
  14. T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y. Yacoob, D. Manocha, and T. Zhou, “Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models,” 2023.
  15. X. Wu, R. Xian, T. Guan, J. Liang, S. Chakraborty, F. Liu, B. Sadler, D. Manocha, and A. S. Bedi, “On the safety concerns of deploying llms/vlms in robotics: Highlighting the risks and vulnerabilities,” 2024.
  16. J. Gao, B. Sarkar, F. Xia, T. Xiao, J. Wu, B. Ichter, A. Majumdar, and D. Sadigh, “Physically grounded vision-language models for robotic manipulation,” in arXiv preprint arXiv:2309.02561, 2023.
  17. V. S. Dorbala, J. F. Mullen, and D. Manocha, “Can an embodied agent find your ”cat-shaped mug”? llm-based zero-shot object navigation,” ArXiv, vol. abs/2303.03480, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257378363
  18. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 10 608–10 615.
  19. K. Yadav, A. Majumdar, R. Ramrakhya, N. Yokoyama, A. Baevski, Z. Kira, O. Maksymets, and D. Batra, “Ovrl-v2: A simple state-of-art baseline for imagenav and objectnav,” ArXiv, vol. abs/2303.07798, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257504864
  20. C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner, “Monet: Unsupervised scene decomposition and representation,” arXiv preprint arXiv:1901.11390, 2019.
  21. M. Engelcke, A. R. Kosiorek, O. P. Jones, and I. Posner, “Genesis: Generative scene inference and sampling with object-centric latent representations,” arXiv preprint arXiv:1907.13052, 2019.
  22. K. Greff, R. L. Kaufman, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, and A. Lerchner, “Multi-object representation learning with iterative variational inference,” in International Conference on Machine Learning.   PMLR, 2019, pp. 2424–2433.
  23. M. Engelcke, O. Parker Jones, and I. Posner, “Genesis-v2: Inferring unordered object representations without iterative refinement,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34.   Curran Associates, Inc., 2021, pp. 8085–8094. [Online]. Available: https://proceedings.neurips.cc/paper˙files/paper/2021/file/43ec517d68b6edd3015b3edc9a11367b-Paper.pdf
  24. Z. Lin, Y.-F. Wu, S. V. Peri, W. Sun, G. Singh, F. Deng, J. Jiang, and S. Ahn, “Space: Unsupervised object-oriented scene representation via spatial attention and decomposition,” arXiv preprint arXiv:2001.02407, 2020.
  25. H. W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83–97, 1955. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109
  26. T.-Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:14113767
  27. B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2641–2649.
  28. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
  29. F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf, “Object-centric learning with slot attention,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33.   Curran Associates, Inc., 2020, pp. 11 525–11 538. [Online]. Available: https://proceedings.neurips.cc/paper˙files/paper/2020/file/8511df98c02ab60aea1b2356c013bc0f-Paper.pdf
  30. K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).   Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1724–1734. [Online]. Available: https://aclanthology.org/D14-1179
  31. C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. V. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International Conference on Machine Learning, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:231879586
  32. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” ArXiv, vol. abs/1810.04805, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:52967399
  33. J. A. Placed, J. Strader, H. Carrillo, N. A. Atanasov, V. Indelman, L. Carlone, and J. A. Castellanos, “A survey on active simultaneous localization and mapping: State of the art and new frontiers,” IEEE Transactions on Robotics, vol. 39, pp. 1686–1705, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:250244121
  34. J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022.
  35. X. Wang, R. Xian, T. Guan, and D. Manocha, “Plar: Prompt learning for action recognition,” 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Tianrui Guan (29 papers)
  2. Yurou Yang (1 paper)
  3. Harry Cheng (14 papers)
  4. Muyuan Lin (1 paper)
  5. Richard Kim (7 papers)
  6. Rajasimman Madhivanan (5 papers)
  7. Arnie Sen (12 papers)
  8. Dinesh Manocha (366 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com