Towards Versatile and Efficient Visual Knowledge Integration into Pre-trained Language Models with Cross-Modal Adapters (2305.07358v4)

Published 12 May 2023 in cs.CL

Abstract: Humans learn language via multi-modal knowledge. However, due to the text-only pre-training scheme, most existing pre-trained LLMs (PLMs) are hindered from the multi-modal information. To inject visual knowledge into PLMs, existing methods incorporate either the text or image encoder of vision-LLMs (VLMs) to encode the visual information and update all the original parameters of PLMs for knowledge fusion. In this paper, we propose a new plug-and-play module, X-adapter, to flexibly leverage the aligned visual and textual knowledge learned in pre-trained VLMs and efficiently inject them into PLMs. Specifically, we insert X-adapters into PLMs, and only the added parameters are updated during adaptation. To fully exploit the potential in VLMs, X-adapters consist of two sub-modules, V-expert and T-expert, to fuse VLMs' image and text representations, respectively. We can opt for activating different sub-modules depending on the downstream tasks. Experimental results show that our method can significantly improve the performance on object-color reasoning and natural language understanding (NLU) tasks compared with PLM baselines.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (44)

Authors (4)

Xinyun Zhang (9 papers)
Haochen Tan (13 papers)
Han Wu (124 papers)
Bei Yu (113 papers)

Citations (1)

View on Semantic Scholar

Towards Versatile and Efficient Visual Knowledge Integration into Pre-trained Language Models with Cross-Modal Adapters (2305.07358v4)

Related Papers