MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models (2306.01311v1)

Published 2 Jun 2023 in cs.CL

Abstract: Large-scale LLMs have shown the ability to adapt to a new task via conditioning on a few demonstrations (i.e., in-context learning). However, in the vision-language domain, most large-scale pre-trained vision-language (VL) models do not possess the ability to conduct in-context learning. How can we enable in-context learning for VL models? In this paper, we study an interesting hypothesis: can we transfer the in-context learning ability from the language domain to VL domain? Specifically, we first meta-trains a LLM to perform in-context learning on NLP tasks (as in MetaICL); then we transfer this model to perform VL tasks by attaching a visual encoder. Our experiments suggest that indeed in-context learning ability can be transferred cross modalities: our model considerably improves the in-context learning capability on VL tasks and can even compensate for the size of the model significantly. On VQA, OK-VQA, and GQA, our method could outperform the baseline model while having 20 times fewer parameters.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (5)

Masoud Monajatipoor (9 papers)
Liunian Harold Li (19 papers)
Mozhdeh Rouhsedaghat (9 papers)
Lin F. Yang (86 papers)
Kai-Wei Chang (292 papers)

Citations (8)

View on Semantic Scholar

MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models (2306.01311v1)

Related Papers