MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer (2401.10208v2)

Published 18 Jan 2024 in cs.CV and cs.CL

Abstract: Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context during the generation process. MM-Interleaved is end-to-end pre-trained on both paired and interleaved image-text corpora. It is further enhanced through a supervised fine-tuning phase, wherein the model improves its ability to follow complex multi-modal instructions. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions. Code and models are available at \url{https://github.com/OpenGVLab/MM-Interleaved}.

PDF HTML Abstract

An Overview of MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

The paper entitled "MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer" presents a novel approach to multi-modal generative modeling focused on effectively handling interleaved image-text data. This type of data presents unique challenges and opportunities, as it combines both image and textual information in intertwined formats common in online content. The proposed model, MM-Interleaved, addresses the limitations observed in current models which typically struggle to capture fine-grained image details due to the constraints of using a fixed number of visual tokens, especially when dealing with multiple images.

Model Architecture

MM-Interleaved is built upon three primary components: a Visual Foundation Model (VFM), a LLM, and a Diffusion Model (DM). This combination is strategically chosen to harness the strengths of each model type in understanding and generating both text and images. A noteworthy innovation in this work is the introduction of a Multi-modal Feature Synchronizer (MMFS). This mechanism is designed to allow efficient access to detailed image features across multiple images and scales. The MMFS is based on deformable sparse attention mechanisms, which optimize the observation of multi-scale, high-resolution image features, thereby reducing the information loss often encountered in multi-modal LLMs.

Training and Evaluation

The model training involves two main stages: pre-training and supervised fine-tuning. The pre-training leverages a mixture of paired and interleaved image-text data, ensuring the model encounters a diverse range of inputs. Subsequently, fine-tuning enhances the model's performance on specific tasks, such as visual question-answering and visual storytelling, among others.

The model's evaluation demonstrates robust capabilities across various benchmarks. Notably, it excels in tasks requiring both text and image understanding and generation, achieving competitive results in comparison to existing multi-modal models. When fine-tuned, the model achieves state-of-the-art performance on several image captioning and visual question-answering datasets. Additionally, the model is evaluated on multi-image and interleaved image-text generation tasks, showcasing its ability to maintain spatial semantic consistency and generate coherent and contextually aligned outputs.

Implications and Future Directions

The implications of this research are significant for the advancement of multi-modal generative models. By efficiently integrating fine-grained image features into the modeling process, MM-Interleaved expands the potential applications of such models, particularly in areas requiring detailed image comprehension alongside textual data, such as augmented reality and advanced conversational AI systems. Further development could explore scaling the model and data sizes and end-to-end full-parameter training to enrich the model’s capabilities and robustness. Additionally, establishing a comprehensive benchmark for interleaved image-text modeling would provide a valuable resource for continued research and validation in this field.

PDF Markdown Bookmark Chat (Pro)

References (111)

Authors (13)

Changyao Tian (9 papers)
Xizhou Zhu (73 papers)
Yuwen Xiong (35 papers)
Weiyun Wang (20 papers)
Zhe Chen (237 papers)
Wenhai Wang (123 papers)
Yuntao Chen (37 papers)
Lewei Lu (55 papers)
Tong Lu (85 papers)
Jie Zhou (687 papers)
Hongsheng Li (340 papers)
Yu Qiao (563 papers)
Jifeng Dai (131 papers)

Citations (32)

View on Semantic Scholar

GitHub

GitHub - OpenGVLab/MM-Interleaved: MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer (162 stars)