Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation (2308.11561v5)

Published 22 Aug 2023 in cs.CV

Abstract: This report details the methods of the winning entry of the AVDN Challenge in ICCV CLVL 2023. The competition addresses the Aerial Navigation from Dialog History (ANDH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination. For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages a graph-aware transformer to capture spatiotemporal dependency, which benefits navigation state tracking and robust action planning. In addition,an auxiliary visual grounding task is devised to boost the agent's awareness of referred landmarks. Moreover, a hybrid augmentation strategy based on LLMs is utilized to mitigate data scarcity limitations. Our TG-GAT framework won the AVDN Challenge, with 2.2% and 3.0% absolute improvements over the baseline on SPL and SR metrics, respectively. The code is available at https://github.com/yifeisu/TG-GAT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683, 2018.
  2. Vision-and-dialog navigation. In Conference on Robot Learning, pages 394–406. PMLR, 2020.
  3. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  4. R2h: Building multimodal navigation helpers that respond to help, 2023.
  5. Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In The European Conference on Computer Vision (ECCV), September 2018.
  6. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6629–6638, 2019.
  7. Environment-agnostic multitask learning for natural language grounded navigation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pages 413–430. Springer, 2020.
  8. Vision-language navigation policy learning and adaptation. IEEE transactions on pattern analysis and machine intelligence, 43(12):4205–4216, 2020.
  9. Vln bert: A recurrent vision-and-language bert for navigation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 1643–1653, 2021.
  10. History aware multimodal transformer for vision-and-language navigation. Advances in neural information processing systems, 34:5834–5847, 2021.
  11. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537–16547, 2022.
  12. Bevbert: Multimodal map pre-training for language-guided navigation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  13. Scaling data generation in vision-and-language navigation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  14. Aerial vision-and-dialog navigation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3043–3061, Toronto, Canada, July 2023. Association for Computational Linguistics.
  15. xview: Objects in context in overhead imagery. arXiv preprint arXiv:1802.07856, 2018.
  16. Episodic transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15942–15952, 2021.
  17. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  18. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  19. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  20. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  21. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1769–1779, 2021.
  22. Albumentations: Fast and flexible image augmentations. Information, 11(2), 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yifei Su (6 papers)
  2. Dong An (43 papers)
  3. Yuan Xu (123 papers)
  4. Kehan Chen (6 papers)
  5. Yan Huang (180 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub