Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer (2401.10208v2)

Published 18 Jan 2024 in cs.CV and cs.CL
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

Abstract: Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context during the generation process. MM-Interleaved is end-to-end pre-trained on both paired and interleaved image-text corpora. It is further enhanced through a supervised fine-tuning phase, wherein the model improves its ability to follow complex multi-modal instructions. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions. Code and models are available at \url{https://github.com/OpenGVLab/MM-Interleaved}.

An Overview of MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

The paper entitled "MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer" presents a novel approach to multi-modal generative modeling focused on effectively handling interleaved image-text data. This type of data presents unique challenges and opportunities, as it combines both image and textual information in intertwined formats common in online content. The proposed model, MM-Interleaved, addresses the limitations observed in current models which typically struggle to capture fine-grained image details due to the constraints of using a fixed number of visual tokens, especially when dealing with multiple images.

Model Architecture

MM-Interleaved is built upon three primary components: a Visual Foundation Model (VFM), a LLM, and a Diffusion Model (DM). This combination is strategically chosen to harness the strengths of each model type in understanding and generating both text and images. A noteworthy innovation in this work is the introduction of a Multi-modal Feature Synchronizer (MMFS). This mechanism is designed to allow efficient access to detailed image features across multiple images and scales. The MMFS is based on deformable sparse attention mechanisms, which optimize the observation of multi-scale, high-resolution image features, thereby reducing the information loss often encountered in multi-modal LLMs.

Training and Evaluation

The model training involves two main stages: pre-training and supervised fine-tuning. The pre-training leverages a mixture of paired and interleaved image-text data, ensuring the model encounters a diverse range of inputs. Subsequently, fine-tuning enhances the model's performance on specific tasks, such as visual question-answering and visual storytelling, among others.

The model's evaluation demonstrates robust capabilities across various benchmarks. Notably, it excels in tasks requiring both text and image understanding and generation, achieving competitive results in comparison to existing multi-modal models. When fine-tuned, the model achieves state-of-the-art performance on several image captioning and visual question-answering datasets. Additionally, the model is evaluated on multi-image and interleaved image-text generation tasks, showcasing its ability to maintain spatial semantic consistency and generate coherent and contextually aligned outputs.

Implications and Future Directions

The implications of this research are significant for the advancement of multi-modal generative models. By efficiently integrating fine-grained image features into the modeling process, MM-Interleaved expands the potential applications of such models, particularly in areas requiring detailed image comprehension alongside textual data, such as augmented reality and advanced conversational AI systems. Further development could explore scaling the model and data sizes and end-to-end full-parameter training to enrich the model’s capabilities and robustness. Additionally, establishing a comprehensive benchmark for interleaved image-text modeling would provide a valuable resource for continued research and validation in this field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (111)
  1. Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
  2. Nocaps: Novel object captioning at scale. In ICCV, 2019.
  3. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  4. VQA: visual question answering. In ICCV, 2015.
  5. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  6. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  7. Language models are few-shot learners. NeurIPS, 2020.
  8. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  9. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023a.
  10. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
  11. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  12. Vision transformer adapter for dense predictions. In ICLR, 2022.
  13. Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723, 2017.
  14. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  15. Visual dialog. In CVPR, 2017.
  16. Cogview: Mastering text-to-image generation via transformers. NeurIPS, 2021.
  17. Cogview2: Faster and better text-to-image generation via hierarchical transformers. NeurIPS, 2022.
  18. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
  19. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  20. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  21. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
  22. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
  23. Improved visual story generation with adaptive context modeling. arXiv preprint arXiv:2305.16811, 2023.
  24. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  25. Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV, 2022.
  26. Openllama: An open reproduction of llama, 2023.
  27. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  28. Imagine this! scripts to compositions to videos. In ECCV, 2018.
  29. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
  30. Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336, 2022.
  31. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
  32. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  33. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  34. Visual storytelling. In NAACL, 2016.
  35. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  36. IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023.
  37. Openclip, 2021.
  38. Oneformer: One transformer to rule universal image segmentation. In CVPR, 2023.
  39. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  40. Unified language-vision pretraining in llm with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669, 2023.
  41. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2018.
  42. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
  43. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023.
  44. A hierarchical approach for generating descriptive image paragraphs. In CVPR, 2017.
  45. Obelics: An open web-scale filtered dataset of interleaved image-text documents. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  46. Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 2021.
  47. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  48. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  49. Storygan: A sequential conditional gan for story visualization. In CVPR, 2019.
  50. Scaling language-image pre-training via masking. In CVPR, 2023b.
  51. Microsoft coco: Common objects in context. In ECCV, 2014.
  52. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  53. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  54. Calvin Luo. Understanding diffusion models: A unified perspective. arXiv preprint arXiv:2208.11970, 2022.
  55. Kosmos-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419, 2023.
  56. Integrating visuospatial, linguistic and commonsense structure into story visualization. arXiv preprint arXiv:2110.10834, 2021.
  57. Storydall-e: Adapting pretrained text-to-image transformers for story continuation. In ECCV, 2022.
  58. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
  59. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  60. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
  61. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022.
  62. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pages 947–952. IEEE, 2019.
  63. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  64. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  65. TB OpenAI. Chatgpt: Optimizing language models for dialogue. OpenAI, 2022.
  66. Synthesizing coherent story with auto-regressive latent diffusion models. arXiv preprint arXiv:2211.10950, 2022.
  67. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  68. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.
  69. Connecting vision and language with localized narratives. In ECCV, 2020.
  70. Learning transferable visual models from natural language supervision. In ICML, 2021.
  71. Zero-shot text-to-image generation. In ICML, 2021.
  72. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  73. High-resolution image synthesis with latent diffusion models. In CVPR, 2022a.
  74. High-resolution image synthesis with latent diffusion models. In CVPR, 2022b.
  75. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
  76. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022a.
  77. Laion coco: 600m synthetic captions from laion2b-en. https://laion.ai/blog/laion-coco/, 2022b.
  78. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
  79. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, pages 742–758. Springer, 2020.
  80. Towards vqa models that can read. In CVPR, 2019.
  81. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023a.
  82. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023b.
  83. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023c.
  84. Sphinx: Enabling privacy-preserving online learning over the cloud. In 2022 IEEE Symposium on Security and Privacy (SP), 2022.
  85. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  86. Neural discrete representation learning. NeurIPS, 2017.
  87. Cider: Consensus-based image description evaluation. In CVPR, 2015.
  88. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022a.
  89. Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022b.
  90. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023a.
  91. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023b.
  92. On the general value of evidence, and bilingual scene-text visual question answering. In CVPR, pages 10126–10135, 2020.
  93. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021.
  94. Multilingual universal sentence encoder for semantic retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020.
  95. mplug-docowl: Modularized multimodal large language model for document understanding, 2023.
  96. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
  97. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022a.
  98. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022b.
  99. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023.
  100. Adding conditional control to text-to-image diffusion models. In ICCV, 2023a.
  101. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023b.
  102. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023c.
  103. Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915, 2023.
  104. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239, 2023a.
  105. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023b.
  106. Scene parsing through ade20k dataset. In CVPR, 2017.
  107. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
  108. Vl-gpt: A generative pre-trained transformer for vision and language understanding and generation. arXiv preprint arXiv:2312.09251, 2023b.
  109. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023c.
  110. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
  111. Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. arXiv preprint arXiv:2112.01522, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Changyao Tian (9 papers)
  2. Xizhou Zhu (73 papers)
  3. Yuwen Xiong (35 papers)
  4. Weiyun Wang (20 papers)
  5. Zhe Chen (237 papers)
  6. Wenhai Wang (123 papers)
  7. Yuntao Chen (37 papers)
  8. Lewei Lu (55 papers)
  9. Tong Lu (85 papers)
  10. Jie Zhou (687 papers)
  11. Hongsheng Li (340 papers)
  12. Yu Qiao (563 papers)
  13. Jifeng Dai (131 papers)
Citations (32)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets