Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs (2311.15759v1)

Published 27 Nov 2023 in cs.CL, cs.AI, and cs.CV

Abstract: Recent advancements in multimodal LLMs (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual-language understanding, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance overall capabilities of LLMs, which could be regraded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce the Modular Visual Memory, a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixtures-of-Multimodal Experts architecture in LLMs to invoke multimodal knowledge collaboration during generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on multimodal benchmarks.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (57)

Authors (5)

Yunxin Li (29 papers)
Baotian Hu (67 papers)
Wei Wang (1793 papers)
Xiaochun Cao (177 papers)
Min Zhang (630 papers)

Citations (4)

View on Semantic Scholar

Tweets

https://twitter.com/1379059688151519233/status/1734930621905154290

Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs (2311.15759v1)

Related Papers

Tweets