Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation (2302.06072v2)

Published 13 Feb 2023 in cs.CV and cs.AI

Abstract: Vision-Language Navigation (VLN) is a challenging task which requires an agent to align complex visual observations to language instructions to reach the goal position. Most existing VLN agents directly learn to align the raw directional features and visual features trained using one-hot labels to linguistic instruction features. However, the big semantic gap among these multi-modal inputs makes the alignment difficult and therefore limits the navigation performance. In this paper, we propose Actional Atomic-Concept Learning (AACL), which maps visual observations to actional atomic concepts for facilitating the alignment. Specifically, an actional atomic concept is a natural language phrase containing an atomic action and an object, e.g., ``go up stairs''. These actional atomic concepts, which serve as the bridge between observations and instructions, can effectively mitigate the semantic gap and simplify the alignment. AACL contains three core components: 1) a concept mapping module to map the observations to the actional atomic concept representations through the VLN environment and the recently proposed Contrastive Language-Image Pretraining (CLIP) model, 2) a concept refining adapter to encourage more instruction-oriented object concept extraction by re-ranking the predicted object concepts by CLIP, and 3) an observation co-embedding module which utilizes concept representations to regularize the observation representations. Our AACL establishes new state-of-the-art results on both fine-grained (R2R) and high-level (REVERIE and R2R-Last) VLN benchmarks. Moreover, the visualization shows that AACL significantly improves the interpretability in action decision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments. In CVPR.
  2. Layer Normalization. ArXiv, abs/1607.06450.
  3. TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments. In CVPR.
  4. History Aware Multimodal Transformer for Vision-and-Language Navigation. In NeurIPS.
  5. Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation. In CVPR.
  6. UNITER: UNiversal Image-TExt Representation Learning. In ECCV.
  7. Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation. In ACL.
  8. VirTex: Learning Visual Representations from Textual Annotations. In CVPR.
  9. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv, abs/1810.04805.
  10. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
  11. Speaker-Follower Models for Vision-and-Language Navigation. In NeurIPS.
  12. Counterfactual Vision-and-Language Navigation via Adversarial Path Sampler. In ECCV.
  13. CLIP-Adapter: Better Vision-Language Models with Feature Adapters. ArXiv, abs/2110.04544.
  14. Airbert: In-domain Pretraining for Vision-and-Language Navigation. In ICCV.
  15. Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training. In CVPR.
  16. Deep Residual Learning for Image Recognition. In CVPR.
  17. VLN BERT: A Recurrent Vision-and-Language BERT for Navigation. In CVPR.
  18. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In ICML.
  19. Simple but Effective: CLIP Embeddings for Embodied AI. In CVPR.
  20. Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding. In EMNLP.
  21. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training. In AAAI.
  22. VisualBERT: A Simple and Performant Baseline for Vision and Language. ArXiv, abs/1908.03557.
  23. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In ECCV.
  24. Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration. In ACL.
  25. Scene-Intuitive Agent for Remote Embodied Visual Grounding. In CVPR.
  26. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS.
  27. Self-Monitoring Navigation Agent via Auxiliary Progress Estimation. In ICLR.
  28. MindSpore. 2022. MindSpore. https://www.mindspore.cn/.
  29. Asynchronous Methods for Deep Reinforcement Learning. In ICML.
  30. SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation. In NeurIPS.
  31. Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning. In EMNLP.
  32. The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation. In ICCV.
  33. Object-and-Action Aware Model for Visual Language Navigation. In ECCV.
  34. REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In CVPR.
  35. HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation. In CVPR.
  36. Learning Transferable Visual Models From Natural Language Supervision. In ICML.
  37. DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting. In CVPR.
  38. Learning Visual Representations with Caption Annotations. In ECCV.
  39. How Much Can CLIP Benefit Vision-and-Language Tasks? In ICLR.
  40. CLIP Models are Few-Shot Learners: Empirical Studies on VQA and Visual Entailment. In ACL.
  41. ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension. In ACL.
  42. Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout. In NAACL-HLT.
  43. Soft Expert Reward Learning for Vision-and-Language Navigation. In ECCV.
  44. Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation. In CVPR.
  45. Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks. In CVPR.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Bingqian Lin (19 papers)
  2. Yi Zhu (233 papers)
  3. Xiaodan Liang (318 papers)
  4. Liang Lin (318 papers)
  5. Jianzhuang Liu (91 papers)
Citations (2)