Multi-Modal Generative Embedding Model (2405.19333v1)

Published 29 May 2024 in cs.CV

Abstract: Most multi-modal tasks can be formulated into problems of either generation or embedding. Existing models usually tackle these two types of problems by decoupling language modules into a text decoder for generation, and a text encoder for embedding. To explore the minimalism of multi-modal paradigms, we attempt to achieve only one model per modality in this work. We propose a Multi-Modal Generative Embedding Model (MM-GEM), whereby the generative and embedding objectives are encapsulated in one LLM. We also propose a PoolAggregator to boost efficiency and enable the ability of fine-grained embedding and generation. A surprising finding is that these two objectives do not significantly conflict with each other. For example, MM-GEM instantiated from ViT-Large and TinyLlama shows competitive performance on benchmarks for multimodal embedding models such as cross-modal retrieval and zero-shot classification, while has good ability of image captioning. Additionally, MM-GEM can seamlessly execute region-level image caption generation and retrieval tasks. Besides, the advanced text model in MM-GEM brings over 5% improvement in Recall@1 for long text and image retrieval.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (45)

Authors (10)

Feipeng Ma (8 papers)
Hongwei Xue (10 papers)
Guangting Wang (11 papers)
Yizhou Zhou (29 papers)
Fengyun Rao (25 papers)
Shilin Yan (20 papers)
Yueyi Zhang (28 papers)
Siying Wu (5 papers)
Mike Zheng Shou (165 papers)
Xiaoyan Sun (46 papers)

Multi-Modal Generative Embedding Model (2405.19333v1)

Related Papers