Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TDANet: Target-Directed Attention Network For Object-Goal Visual Navigation With Zero-Shot Ability (2404.08353v2)

Published 12 Apr 2024 in cs.CV and cs.RO

Abstract: The generalization of the end-to-end deep reinforcement learning (DRL) for object-goal visual navigation is a long-standing challenge since object classes and placements vary in new test environments. Learning domain-independent visual representation is critical for enabling the trained DRL agent with the ability to generalize to unseen scenes and objects. In this letter, a target-directed attention network (TDANet) is proposed to learn the end-to-end object-goal visual navigation policy with zero-shot ability. TDANet features a novel target attention (TA) module that learns both the spatial and semantic relationships among objects to help TDANet focus on the most relevant observed objects to the target. With the Siamese architecture (SA) design, TDANet distinguishes the difference between the current and target states and generates the domain-independent visual representation. To evaluate the navigation performance of TDANet, extensive experiments are conducted in the AI2-THOR embodied AI environment. The simulation results demonstrate a strong generalization ability of TDANet to unseen scenes and target objects, with higher navigation success rate (SR) and success weighted by length (SPL) than other state-of-the-art models. TDANet is finally deployed on a wheeled robot in real scenes, demonstrating satisfactory generalization of TDANet to the real world.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. H. Du, X. Yu, and L. Zheng, “Learning object relation graph and tentative policy for visual navigation,” in Proc. Eur. Conf. Comput. Vision, 2020, pp. 19–34.
  2. Y. Lyu, Y. Shi, and X. Zhang, “Improving target-driven visual navigation with attention on 3d spatial relationships,” Neural Process. Lett., vol. 54, pp. 3979 – 3998, 2022.
  3. H. Du, X. Yu, and L. Zheng, “VTNet: Visual transformer network for object goal navigation,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–16.
  4. R. Fukushima, K. Ota, A. Kanezaki, Y. Sasaki, and Y. Yoshiyasu, “Object memory transformer for object goal navigation,” in Proc. Int. Conf. Robot. Autom., 2022, pp. 11 288–11 294.
  5. W. Yang, X. Wang, A. Farhadi, A. K. Gupta, and R. Mottaghi, “Visual semantic navigation using scene priors,” 2018, arXiv:1810.06543.
  6. A. Pal, Y. Qiu, and H. Christensen, “Learning hierarchical relationships for object-goal navigation,” in Proc. Conf. Robot Learn., vol. 155, 2021, pp. 517–528.
  7. F.-F. Li, C. Guo, H. Zhang, and B. Luo, “Context vector-based visual mapless navigation in indoor using hierarchical semantic information and meta-learning,” Complex Intell. Syst., vol. 9, pp. 2031–2041, 2022.
  8. R. Druon, Y. Yoshiyasu, A. Kanezaki, and A. Watt, “Visual object search by learning spatial context,” IEEE Robot. Autom. Lett., vol. 5, no. 2, pp. 1279–1286, 2020.
  9. Q. Zhao, L. Zhang, B. He, H. Qiao, and Z. Liu, “Zero-shot object goal visual navigation,” in Proc. IEEE Int. Conf. Robot. Autom., 2023, pp. 2025–2031.
  10. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Representations, 2021, pp. 1–22.
  11. E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y. Zhu, A. Kembhavi, A. K. Gupta, and A. Farhadi, “Ai2-thor: An interactive 3d environment for visual ai,” 2017, arXiv:1712.05474.
  12. Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in Proc. IEEE Int. Conf. Robot. Autom., 2017, pp. 3357–3364.
  13. M. Wortsman, K. Ehsani, M. Rastegari, A. Farhadi, and R. Mottaghi, “Learning to learn how to learn: Self-adaptive visual navigation using meta-learning,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., 2019, pp. 6743–6752.
  14. B. Mayo, T. Hazan, and A. Tal, “Visual navigation with spatial attention,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., 2021, pp. 16 893–16 902.
  15. Q. Zhao, L. Zhang, B. He, and Z. Liu, “Semantic policy network for zero-shot object goal visual navigation,” IEEE Robot. Autom. Lett., vol. 8, no. 11, pp. 7655–7662, 2023.
  16. S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., 2023, pp. 23 171–23 181.
  17. M. Chang, T. Gervet, M. Khanna, S. Yenamandra, D. Shah, S. Y. Min, K. Shah, C. Paxton, S. Gupta, D. Batra, R. Mottaghi, J. Malik, and D. S. Chaplot, “Goat: Go to any thing,” 2023, arXiv:2311.06430.
  18. A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra, “Zson: Zero-shot object-goal navigation using multimodal goal embeddings,” in Proc. Advances Neural Inf. Process. Syst., vol. 35, 2022, pp. 32 340–32 352.
  19. A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi, “Simple but effective: Clip embeddings for embodied ai,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., 2022, pp. 14 809–14 818.
  20. S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., vol. 1, 2005, pp. 539–546.
  21. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn., vol. 139, 2021, pp. 8748–8763.
  22. V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Proc. Int. Conf. Mach. Learn., vol. 48, 2016, pp. 1928–1937.
  23. J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,” in Proc. Conf. Empirical Methods Natural Lang. Process., 2014, pp. 1532–1543.
  24. F. Mokhayeri and E. Granger, “Video face recognition using siamese networks with block-sparsity matching,” IEEE trans. biom. behav. identity sci., vol. 2, no. 2, pp. 133–144, 2020.
  25. X. Wu, A. Kimura, S. Uchida, and K. Kashino, “Prewarping siamese network: Learning local representations for online signature verification,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2019, pp. 2467–2471.
  26. P. Anderson, A. X. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. Zamir, “On evaluation of embodied navigation agents,” 2018, arXiv:1807.06757.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Shiwei Lian (3 papers)
  2. Feitian Zhang (16 papers)
Citations (1)