VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval (2406.04292v1)

Published 6 Jun 2024 in cs.IR, cs.CL, and cs.CV

Abstract: Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-LLMs like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal multi-modal retrieval. Our work brings forth threefold technical contributions. Firstly, we introduce a flexible architecture which extends a powerful text encoder with the image understanding capability by introducing visual token embeddings. Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model. Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data. In our experiments, VISTA achieves superior performances across a variety of multi-modal retrieval tasks in both zero-shot and supervised settings. Our model, data, and source code are available at https://github.com/FlagOpen/FlagEmbedding.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

Authors (5)

Junjie Zhou (28 papers)
Zheng Liu (312 papers)
Shitao Xiao (38 papers)
Bo Zhao (242 papers)
Yongping Xiong (9 papers)

Citations (7)

View on Semantic Scholar

GitHub

GitHub - FlagOpen/FlagEmbedding: Retrieval and Retrieval-augmented LLMs

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval (2406.04292v1)

Related Papers

GitHub