MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Abstract: Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context during the generation process. MM-Interleaved is end-to-end pre-trained on both paired and interleaved image-text corpora. It is further enhanced through a supervised fine-tuning phase, wherein the model improves its ability to follow complex multi-modal instructions. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions. Code and models are available at \url{https://github.com/OpenGVLab/MM-Interleaved}.
- Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
- Nocaps: Novel object captioning at scale. In ICCV, 2019.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- VQA: visual question answering. In ICCV, 2015.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Language models are few-shot learners. NeurIPS, 2020.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023a.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Vision transformer adapter for dense predictions. In ICLR, 2022.
- Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723, 2017.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- Visual dialog. In CVPR, 2017.
- Cogview: Mastering text-to-image generation via transformers. NeurIPS, 2021.
- Cogview2: Faster and better text-to-image generation via hierarchical transformers. NeurIPS, 2022.
- Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
- Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
- Improved visual story generation with adaptive context modeling. arXiv preprint arXiv:2305.16811, 2023.
- Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
- Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV, 2022.
- Openllama: An open reproduction of llama, 2023.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Imagine this! scripts to compositions to videos. In ECCV, 2018.
- Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
- Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
- Visual storytelling. In NAACL, 2016.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023.
- Openclip, 2021.
- Oneformer: One transformer to rule universal image segmentation. In CVPR, 2023.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Unified language-vision pretraining in llm with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669, 2023.
- Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2018.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
- Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023.
- A hierarchical approach for generating descriptive image paragraphs. In CVPR, 2017.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 2021.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Storygan: A sequential conditional gan for story visualization. In CVPR, 2019.
- Scaling language-image pre-training via masking. In CVPR, 2023b.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
- Calvin Luo. Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970, 2022.
- Kosmos-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419, 2023.
- Integrating visuospatial, linguistic and commonsense structure into story visualization. arXiv preprint arXiv:2110.10834, 2021.
- Storydall-e: Adapting pretrained text-to-image transformers for story continuation. In ECCV, 2022.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
- Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022.
- Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pages 947–952. IEEE, 2019.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- TB OpenAI. Chatgpt: Optimizing language models for dialogue. OpenAI, 2022.
- Synthesizing coherent story with auto-regressive latent diffusion models. arXiv preprint arXiv:2211.10950, 2022.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.
- Connecting vision and language with localized narratives. In ECCV, 2020.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Zero-shot text-to-image generation. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022a.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022b.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022a.
- Laion coco: 600m synthetic captions from laion2b-en. https://laion.ai/blog/laion-coco/, 2022b.
- Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
- Textcaps: a dataset for image captioning with reading comprehension. In ECCV, pages 742–758. Springer, 2020.
- Towards vqa models that can read. In CVPR, 2019.
- Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023a.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023b.
- Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023c.
- Sphinx: Enabling privacy-preserving online learning over the cloud. In 2022 IEEE Symposium on Security and Privacy (SP), 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Neural discrete representation learning. NeurIPS, 2017.
- Cider: Consensus-based image description evaluation. In CVPR, 2015.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022a.
- Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022b.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023a.
- The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023b.
- On the general value of evidence, and bilingual scene-text visual question answering. In CVPR, pages 10126–10135, 2020.
- Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021.
- Multilingual universal sentence encoder for semantic retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020.
- mplug-docowl: Modularized multimodal large language model for document understanding, 2023.
- Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022a.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022b.
- Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023a.
- Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023b.
- Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023c.
- Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915, 2023.
- Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239, 2023a.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023b.
- Scene parsing through ade20k dataset. In CVPR, 2017.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
- Vl-gpt: A generative pre-trained transformer for vision and language understanding and generation. arXiv preprint arXiv:2312.09251, 2023b.
- Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023c.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
- Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. arXiv preprint arXiv:2112.01522, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.