Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Evolution of Multimodal Model Architectures (2405.17927v1)

Published 28 May 2024 in cs.AI, cs.CL, cs.CV, cs.LG, and eess.AS

Abstract: This work uniquely identifies and characterizes four prevalent multimodal model architectural patterns in the contemporary multimodal landscape. Systematically categorizing models by architecture type facilitates monitoring of developments in the multimodal domain. Distinct from recent survey papers that present general information on multimodal architectures, this research conducts a comprehensive exploration of architectural details and identifies four specific architectural types. The types are distinguished by their respective methodologies for integrating multimodal inputs into the deep neural network model. The first two types (Type A and B) deeply fuses multimodal inputs within the internal layers of the model, whereas the following two types (Type C and D) facilitate early fusion at the input stage. Type-A employs standard cross-attention, whereas Type-B utilizes custom-designed layers for modality fusion within the internal layers. On the other hand, Type-C utilizes modality-specific encoders, while Type-D leverages tokenizers to process the modalities at the model's input stage. The identified architecture types aid the monitoring of any-to-any multimodal model development. Notably, Type-C and Type-D are currently favored in the construction of any-to-any multimodal models. Type-C, distinguished by its non-tokenizing multimodal model architecture, is emerging as a viable alternative to Type-D, which utilizes input-tokenizing techniques. To assist in model selection, this work highlights the advantages and disadvantages of each architecture type based on data and compute requirements, architecture complexity, scalability, simplification of adding modalities, training objectives, and any-to-any multimodal generation capability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (192)
  1. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  2. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023.
  3. 4m: Massively multimodal masked modeling. Advances in Neural Information Processing Systems, 36, 2024.
  4. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023a.
  5. Teal: Tokenize and embed all for multi-modal large language models, 2024.
  6. Codi-2: In-context, interleaved, and interactive any-to-any generation. arXiv preprint arXiv:2311.18775, 2023a.
  7. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  8. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  9. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  10. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  11. Otter: A multi-modal model with in-context instruction tuning, 2023a.
  12. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  13. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024a.
  14. Dolphins: Multimodal language model for driving. arXiv preprint arXiv:2312.00438, 2023.
  15. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023a.
  16. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  17. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023a.
  18. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257, 2023a.
  19. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023a.
  20. Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. arXiv preprint arXiv:2401.10208, 2024.
  21. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024.
  22. Unified-io: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2022a.
  23. Mm-llms: Recent advances in multimodal large language models, 2024a.
  24. A survey on multimodal large language models, 2024.
  25. The (r) evolution of multimodal large language models: A survey. arXiv preprint arXiv:2402.12451, 2024.
  26. Large-scale multi-modal pre-trained models: A comprehensive survey. Machine Intelligence Research, 20(4):447–482, 2023b.
  27. Multimodal large language models: A survey. In 2023 IEEE International Conference on Big Data (BigData), pages 2247–2256. IEEE, 2023b.
  28. A survey on image-text multimodal models. arXiv preprint arXiv:2309.15857, 2023.
  29. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  30. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018.
  31. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023b.
  32. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pages 1931–1942. PMLR, 2021.
  33. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  34. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  35. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  36. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019.
  37. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  38. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017.
  39. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335, 2017.
  40. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  41. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  42. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems, 36, 2024a.
  43. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  44. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  45. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  46. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
  47. Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
  48. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019.
  49. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024a.
  50. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
  51. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  52. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  53. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019.
  54. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  55. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  56. Tallyqa: Answering complex counting questions. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8076–8084, 2019.
  57. Captioning images taken by people who are blind. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 417–434. Springer, 2020.
  58. Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301, 2019.
  59. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022.
  60. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  61. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
  62. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  63. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017a.
  64. Spoken moments: Learning joint audio-visual representations from video descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14871–14881, 2021.
  65. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021.
  66. Kenneth Heafield. Kenlm: Faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation, pages 187–197, 2011.
  67. The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826, 2022.
  68. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
  69. On the de-duplication of laion-2b. arXiv preprint arXiv:2303.12733, 2023.
  70. Llama: Open and efficient foundation language models (2023). arXiv preprint arXiv:2302.13971, 2023.
  71. Textual explanations for self-driving vehicles. In Proceedings of the European conference on computer vision (ECCV), pages 563–578, 2018.
  72. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  73. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017b.
  74. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  75. Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004, 2016.
  76. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018.
  77. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
  78. Multi30k: Multilingual english-german image descriptions. arXiv preprint arXiv:1605.00459, 2016.
  79. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  80. Cogcom: Train large vision-language models diving into details through chain of manipulations. arXiv preprint arXiv:2402.04236, 2024.
  81. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
  82. Lion: Empowering multimodal large language model with dual-level visual knowledge. arXiv preprint arXiv:2311.11860, 2023c.
  83. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  84. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
  85. Teams ShareGPT. Sharegpt: Share your wildest chatgpt conversations with one click, 2023. https://sharegpt.com/.
  86. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  87. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023b.
  88. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022b.
  89. Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
  90. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
  91. Llavar: Enhanced visual instruction tuning for text-rich image understanding, 2024b.
  92. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  93. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
  94. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023d.
  95. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021.
  96. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024.
  97. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020.
  98. Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. URL https://https://huggingface.co/Open-Orca/SlimOrca.
  99. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024.
  100. Androidinthewild: A large-scale dataset for android device control. Advances in Neural Information Processing Systems, 36, 2024.
  101. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems, 35:26418–26431, 2022.
  102. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
  103. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021a.
  104. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016.
  105. On the general value of evidence, and bilingual scene-text visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10126–10135, 2020a.
  106. Toloka visual question answering benchmark. arXiv preprint arXiv:2309.16511, 2023.
  107. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023a.
  108. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
  109. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.
  110. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023e.
  111. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023b.
  112. Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models. arXiv preprint arXiv:2308.10755, 2023.
  113. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning. arXiv preprint arXiv:2311.10774, 2023c.
  114. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019.
  115. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165, 2021b.
  116. Kvqa: Knowledge-aware visual question answering. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8876–8884, 2019.
  117. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023c.
  118. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
  119. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
  120. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024.
  121. Llava-p⁢h⁢i𝑝ℎ𝑖phiitalic_p italic_h italic_i: Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024b.
  122. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024b. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  123. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  124. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023f.
  125. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167, 2023.
  126. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  127. Generating images with multimodal language models. Advances in Neural Information Processing Systems, 36, 2024.
  128. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023c.
  129. Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning. arXiv preprint arXiv:2307.09474, 2023b.
  130. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
  131. Grounding language models to images for multimodal inputs and outputs. In International Conference on Machine Learning, pages 17283–17300. PMLR, 2023.
  132. Groundinggpt:language enhanced multi-modal grounding model, 2024b.
  133. Modaverse: Efficiently transforming modalities with llms. arXiv preprint arXiv:2401.06395, 2024a.
  134. Mllm-tool: A multimodal large language model for tool agent learning, 2024b.
  135. Vigor: Improving visual grounding of large vision language models with fine-grained reward modeling. arXiv preprint arXiv:2402.06118, 2024.
  136. Any-to-any generation via composable diffusion. Advances in Neural Information Processing Systems, 36, 2024.
  137. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160, 2023g.
  138. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  139. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
  140. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023d.
  141. Bliva: A simple multimodal llm for better handling of text-rich visual questions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 2256–2264, 2024.
  142. Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289, 2023b.
  143. X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. arXiv preprint arXiv:2311.18799, 2023.
  144. Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023c.
  145. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023a.
  146. Tinygpt-v: Efficient multimodal large language model via small backbones. arXiv preprint arXiv:2312.16862, 2023a.
  147. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023b.
  148. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935, 2024.
  149. What matters when building vision-language models? arXiv preprint arXiv:2405.02246, 2024b.
  150. Kosmos-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419, 2023.
  151. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023c.
  152. V*: Guided visual search as a core mechanism in multimodal llms. arXiv preprint arXiv:2312.14135, 2023.
  153. Kosmos-g: Generating images in context with multimodal large language models, 2024.
  154. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024.
  155. mplug-owl: Modularization empowers large language models with multimodality, 2024.
  156. mplug-docowl: Modularized multimodal large language model for document understanding, 2023b.
  157. Embodiedgpt: Vision-language pre-training via embodied chain of thought. Advances in Neural Information Processing Systems, 36, 2024.
  158. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  159. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  160. Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.
  161. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding. arXiv preprint arXiv:2311.11810, 2023.
  162. mplug-paperowl: Scientific diagram analysis with the multimodal large language model. arXiv preprint arXiv:2311.18248, 2023.
  163. Osprey: Pixel understanding with visual instruction tuning. arXiv preprint arXiv:2312.10032, 2023b.
  164. Kam-cot: Knowledge augmented multimodal chain-of-thoughts reasoning. arXiv preprint arXiv:2401.12863, 2024.
  165. Vislinginstruct: Elevating zero-shot learning in multi-modal language models with autonomous instruction optimization. arXiv preprint arXiv:2402.07398, 2024c.
  166. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
  167. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024.
  168. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  169. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  170. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  171. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  172. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  173. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  174. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601, 2024c.
  175. Unified language-vision pretraining in llm with dynamic discrete visual tokenization, 2024.
  176. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2(3), 2023.
  177. Vl-gpt: A generative pre-trained transformer for vision and language understanding and generation. arXiv preprint arXiv:2312.09251, 2023b.
  178. Unicode: Learning a unified codebook for multimodal large language models. arXiv preprint arXiv:2403.09072, 2024.
  179. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023.
  180. Covost 2 and massively multilingual speech-to-text translation. arXiv preprint arXiv:2007.10310, 2020b.
  181. A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 317–325, 2017.
  182. Connecting vision and language with localized narratives. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 647–664. Springer, 2020.
  183. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  184. Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Processing Systems, 36, 2024d.
  185. Multimae: Multi-modal multi-task masked autoencoders. In European Conference on Computer Vision, pages 348–367. Springer, 2022.
  186. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
  187. Simplified state space layers for sequence modeling, 2023.
  188. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  189. Vl-mamba: Exploring state space models for multimodal learning. arXiv preprint arXiv:2403.13600, 2024.
  190. Cobra: Extending mamba to multi-modal large language model for efficient inference. arXiv preprint arXiv:2403.14520, 2024.
  191. Mambatalk: Efficient holistic gesture synthesis with selective state space models. arXiv preprint arXiv:2403.09471, 2024.
  192. Spikemba: Multi-modal spiking saliency mamba for temporal video grounding. arXiv preprint arXiv:2404.01174, 2024c.
Citations (10)

Summary

  • The paper identifies four distinct multimodal architectures based on unique fusion strategies within deep neural networks.
  • It details the use of deep fusion versus early fusion techniques to balance computational cost and model flexibility.
  • The study outlines a roadmap for future multimodal AI by analyzing trends in architecture design and integration methods.

The Evolution of Multimodal Model Architectures

The paper "The Evolution of Multimodal Model Architectures" by Shakti N. Wadekar et al. provides a comprehensive exploration and categorization of contemporary multimodal model architectures. This paper's primary contribution is the systematic identification of four distinct architectural types prevalent in the domain of multimodal models: Type-A, Type-B, Type-C, and Type-D. Each architectural type is characterized by its unique approach to integrating multimodal inputs within deep neural networks, offering a structured framework for understanding the evolution and development of multimodal models.

Categorization of Multimodal Model Architectures

The categorization is primarily based on the fusion stage of the multimodal inputs:

  • Type-A and Type-B utilize deep fusion within the internal layers of the model.
  • Type-C and Type-D emphasize early fusion at the input stage.

Type-A: Standard Cross-Attention based Deep Fusion (SCDF)

In Type-A, multimodal inputs are deeply fused using standard cross-attention within the internal layers of a pretrained LLM. This type is further divided into two subtypes:

  • Subtype A.1 integrates cross-attention layers before each self-attention layer in the decoder (e.g., Flamingo, OpenFlamingo).
  • Subtype A.2 places the cross-attention layers post self-attention in an encoder-decoder setup (e.g., VL-BART, VL-T5).

Type-A models generally require substantial computational resources and training data. Examples include the Flamingo and OpenFlamingo models, which have shown proficiency in handling vision-language tasks by incorporating image and text data via a robust transformer architecture.

Type-B: Custom Layer based Deep Fusion (CLDF)

Type-B models also achieve deep fusion within internal layers but through custom-designed layers rather than standard cross-attention. This type is categorized into:

  • Subtype B.1 uses custom cross-attention layers (e.g., CogVLM, LLaMA-Adapter).
  • Subtype B.2 employs other custom learnable layers (e.g., InternLM-XComposer2, MoE-LLaVA).

Type-B architecture balances the need for computational efficiency and fine-grained modality control, often requiring fewer resources compared to Type-A while offering improved flexibility in customizing layer behavior.

Type-C: Non-Tokenized Early Fusion (NTEF)

Type-C represents the most prevalent and modular architectural type, characterized by early fusion at the input stage without tokenizing the inputs. This type is divided based on the type of module connecting the modality encoders to the LLM:

  • Subtype C.1 uses Linear Layer/MLP (e.g., LLaVA, PaLM-E).
  • Subtype C.2 employs Q-former and Linear Layer/MLP (e.g., BLIP-2, MiniGPT-4).
  • Subtype C.3 utilizes Perceiver Resampler (e.g., Kosmos-G, Monkey).
  • Subtype C.4 includes custom learnable layers (e.g., Video-ChatGPT, Qwen-VL).

Type-C architectures are known for their modularity, ease of construction, and efficient training processes, often leveraging pre-trained components for vision and text alignment. They offer an effective balance of simplicity and performance across diverse multimodal tasks.

Type-D: Tokenized Early Fusion (TEF)

Type-D involves tokenizing the multimodal inputs and feeding them directly into the model, accommodating autoregressive training objectives across modalities. This type splits into:

  • Subtype D.1 utilizing LLM (e.g., CM3Leon, TEAL).
  • Subtype D.2 based on encoder-decoder style transformers (e.g., Unified-IO, 4M).

These models are trained to generate discrete tokens for multiple modalities, allowing comprehensive training using a standard autoregressive objective. However, this often demands extensive computational resources and sophisticated training strategies.

Implications and Future Directions

The categorization of multimodal architectures not only facilitates a systematic understanding of existing models but also provides a foundation for tracking and predicting future trends in multimodal AI.

Implications:

  • Practical: The detailed taxonomy assists researchers and practitioners in selecting appropriate models based on specific requirements such as data and compute efficiency, scalability, and modality fusion complexity.
  • Theoretical: The distinctions between deep versus early fusion architectures provide insights into the trade-offs between model flexibility and computational efficiency.

Future Directions:

  • The paper suggests that future developments in multimodal AI may see a convergence towards more integrated architectures, where the benefits of deep and early fusion are combined.
  • The exploration of SSMs (State-Space Models) as alternatives to transformer-based architectures presents an exciting avenue for addressing the quadratic complexity challenges of attention mechanisms, potentially leading to more efficient any-to-any multimodal models.

Conclusion

This paper represents a pivotal work in characterizing and understanding multimodal model architectures, offering a valuable framework for both academic research and practical implementation. By elucidating the taxonomy and comparative advantages of Type-A, B, C, and D architectures, the authors provide a roadmap for the evolution of multimodal models, fostering advancements in the creation and deployment of sophisticated AI systems capable of seamlessly integrating and processing diverse modalities. The paper's impact is likely to extend beyond current state-of-the-art models, informing the design of next-generation architectures and applications in multimodal AI.