Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distilling Internet-Scale Vision-Language Models into Embodied Agents (2301.12507v2)

Published 29 Jan 2023 in cs.AI

Abstract: Instruction-following agents must ground language into their observation and action spaces. Learning to ground language is challenging, typically requiring domain-specific engineering or large quantities of human interaction data. To address this challenge, we propose using pretrained vision-LLMs (VLMs) to supervise embodied agents. We combine ideas from model distillation and hindsight experience replay (HER), using a VLM to retroactively generate language describing the agent's behavior. Simple prompting allows us to control the supervision signal, teaching an agent to interact with novel objects based on their names (e.g., planes) or their features (e.g., colors) in a 3D rendered environment. Fewshot prompting lets us teach abstract category membership, including pre-existing categories (food vs toys) and ad-hoc ones (arbitrary preferences over objects). Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Imitating interactive intelligence. arXiv preprint arXiv:2012.05672, 2020.
  2. Do as I can, not as I say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  3. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022.
  4. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, pp.  3674–3683, 2018.
  5. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
  6. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association for Computational Linguistics, 1:49–62, 2013.
  7. Learning to understand goal specifications by modelling reward. In International Conference on Learning Representations, 2018.
  8. Barsalou, L. W. Ad hoc categories. Memory & cognition, 11:211–227, 1983.
  9. Explore and explain: Self-supervised navigation and recounting. In 25th International Conference on Pattern Recognition (ICPR), pp.  1152–1159, 2021.
  10. Learning to map natural language instructions to physical quadcopter control using simulated flight. In CoRL, 2019.
  11. Few-shot object grounding and mapping for natural language robot instruction following. arXiv preprint arXiv:2011.07384, abs/2011.07384, 2020.
  12. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  13. Learning to win by reading manuals in a Monte-Carlo framework. Journal of Artificial Intelligence Research, 43:661–704, 2012.
  14. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  15. LaTTe: Language trajectory transformEr. ArXiv, abs/2208.02918, 2022.
  16. Language to action: Towards interactive task learning with physical agents. In IJCAI, pp.  2–9, 2018.
  17. Actrce: Augmenting experience via teacher’s advice for multi-goal reinforcement learning. arXiv preprint arXiv:1902.04546, 2019.
  18. Gated-attention architectures for task-oriented language grounding. In AAAI, 2018.
  19. Open-vocabulary queryable scene representations for real world planning. arXiv preprint arXiv:2209.09874, 2022a.
  20. Learning to interpret natural language navigation instructions from observations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pp.  859–865, 2011.
  21. Learning from unlabeled 3d environments for vision-and-language navigation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIX, pp.  638–655. Springer, 2022b.
  22. Ask your humans: Using human instructions to improve generalization in reinforcement learning. ICLR, 2020.
  23. Higher: Improving instruction following with hindsight generation for experience replay. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pp.  225–232. IEEE, 2020.
  24. Guiding policies with language via meta-learning. In International Conference on Learning Representations, 2018.
  25. Collaborating with language models for embodied reasoning. In Second Workshop on Language and Reinforcement Learning, 2022.
  26. BERT: Pre-training of deep bidirectional transformers for language understanding. 2018. URL https://arxiv.org/abs/1810.04805.
  27. MineDojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems, 2022.
  28. Speaker-follower models for vision-and-language navigation. Advances in Neural Information Processing Systems, 31, 2018.
  29. Airbert: In-domain pretraining for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1634–1643, 2021.
  30. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13137–13146, 2020.
  31. Harnad, S. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346, 1990.
  32. Grounded language learning fast and slow. In International Conference on Learning Representations, 2021.
  33. What makes certain pre-trained visual representations better for robotic learning? In NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
  34. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. ArXiv, abs/2201.07207, 2022a.
  35. Inner monologue: Embodied reasoning through planning with language models. In 6th Annual Conference on Robot Learning, 2022b.
  36. Interactive Agents Team, D. Creating multimodal interactive agents with imitation and self-supervised learning. arXiv preprint arXiv:2112.03763, 2021.
  37. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  38. Toward understanding natural language directions. In 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp.  259–266. IEEE, 2010.
  39. Active learning for teaching a robot grounded relational symbols. In IJCAI, pp.  1451–1457. Citeseer, 2013.
  40. F-vlm: Open-vocabulary object detection upon frozen vision and language models. arXiv preprint arXiv:2209.15639, 2022.
  41. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. arXiv preprint arXiv:2108.03272, 2021.
  42. Robust navigation with language pretraining and stochastic sampling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  1494–1499, 2019.
  43. Inferring rewards from language in context. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8546–8560, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  44. Prismer: A vision-language model with an ensemble of experts. arXiv preprint arXiv:2303.02506, 2023.
  45. Grounding language in play. arXiv preprint arXiv:2005.07648, 2020.
  46. ZSON: Zero-shot object-goal navigation using multimodal goal embeddings. In Advances in Neural Information Processing Systems, 2022.
  47. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In AAAI, 2016.
  48. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp.  746–751, 2013.
  49. Simple open-vocabulary object detection. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X, pp.  728–755. Springer-Verlag, 2022.
  50. Mapping instructions and visual observations to actions with reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  1004–1015, 2017.
  51. Learning goal-oriented hierarchical tasks from situated interactive instruction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.
  52. Mooney, R. J. Learning to connect language and perception. In AAAI, pp.  1598–1601. Chicago, 2008.
  53. R3M: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
  54. Grounding language for transfer in deep reinforcement learning. J. Artif. Int. Res., 63(1):849–874, sep 2018. ISSN 1076-9757.
  55. Interactive learning from activity description. In International Conference on Machine Learning, pp. 8096–8108. PMLR, 2021.
  56. OpenAI. Gpt-4 technical report, 2023.
  57. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp.  3406–3413, 2016. doi: 10.1109/ICRA.2016.7487517.
  58. Quine, W. V. O. Word & Object. MIT Press, 1960.
  59. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  60. Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022.
  61. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9339–9347, 2019.
  62. LM-Nav: Robotic navigation with large pre-trained models of language, vision, and action. In 6th Annual Conference on Robot Learning, 2022.
  63. Skill induction and planning with latent language. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1713–1726, 2022.
  64. Teaching robots new actions through natural language instructions. In The 23rd IEEE International Symposium on Robot and Human Interactive Communication, pp.  868–873. IEEE, 2014.
  65. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10740–10749, 2020.
  66. Cliport: What and where pathways for robotic manipulation. CoRL, 2021.
  67. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022.
  68. Learning rewards from linguistic feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  6002–6010, 2021.
  69. Semantic exploration from language abstractions and pretrained representations. In Advances in Neural Information Processing Systems, 2022.
  70. Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pp.  1507–1514, 2011.
  71. Robots that use language. Annual Review of Control, Robotics, and Autonomous Systems, 3(1), 2020.
  72. Opportunistic active learning for grounding natural language descriptions. In Proceedings of the 1st Annual Conference on Robot Learning (CoRL-17), 2017.
  73. Winograd, T. Understanding natural language. Cognitive psychology, 3(1):1–191, 1972.
  74. Robotic skill acquisition via instruction augmentation with vision-language models. arXiv preprint arXiv:2211.11736, 2022.
  75. Intra-agent speech permits zero-shot task acquisition. In Advances in Neural Information Processing Systems, 2022.
  76. Interactive grounded language acquisition and generalization in a 2D world. In ICLR, 2018.
  77. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  78. RTFM: Generalising to new environment dynamics via reading. In ICLR, 2020.
  79. SILG: The multi-domain symbolic interactive language grounding benchmark. Advances in Neural Information Processing Systems, 34:21505–21519, 2021.
  80. Improving policy learning via language dynamics distillation. In Advances in Neural Information Processing Systems, 2022.
Citations (20)

Summary

We haven't generated a summary for this paper yet.