The Revolution of Multimodal Large Language Models: A Survey (2402.12451v2)
Abstract: Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of LLMs, significant research efforts are being devoted to the development of Multimodal LLMs (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-following capabilities. In this paper, we provide a comprehensive review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques. We also conduct a detailed analysis of these models across a wide range of tasks, including visual grounding, image generation and editing, visual understanding, and domain-specific applications. Additionally, we compile and describe training datasets and evaluation benchmarks, conducting comparisons among existing models in terms of performance and computational requirements. Overall, this survey offers a comprehensive overview of the current state of the art, laying the groundwork for future MLLMs.
- GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
- nocaps: novel object captioning at scale. In ICCV.
- Jointly Training Large Autoregressive Multimodal Models. In ICLR.
- Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS.
- Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805.
- VQA: Visual Question Answering. In ICCV.
- OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. arXiv preprint arXiv:2308.01390.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv preprint arXiv:2308.12966.
- TouchStone: Evaluating Vision-Language Models by Language Models. arXiv preprint arXiv:2308.16890.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL Workshops.
- InstructPix2Pix: Learning to Follow Image Editing Instructions. In CVPR.
- Language models are few-shot learners. In NeurIPS.
- COYO-700M: Image-Text Pair Dataset.
- Emerging Properties in Self-Supervised Vision Transformers. In ICCV.
- Honeybee: Locality-enhanced Projector for Multimodal LLM. arXiv preprint arXiv:2312.06742.
- Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models. arXiv preprint arXiv:2308.13437.
- Visual Instruction Tuning with Polite Flamingo. arXiv preprint arXiv:2307.01003.
- X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages. arXiv preprint arXiv:2305.04160.
- LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge. arXiv preprint arXiv:2311.11860.
- MiniGPT-v2: Large Language Model As a Unified Interface for Vision-Language Multi-task Learning. arXiv preprint arXiv:2310.09478.
- Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195.
- PaLI-X: On Scaling up a Multilingual Vision and Language Model. arXiv preprint arXiv:2305.18565.
- PaLI-3 Vision Language Models: Smaller, Faster, Stronger. arXiv preprint arXiv:2310.09199.
- PaLI: A Jointly-Scaled Multilingual Language-Image Model. In ICLR.
- DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback. arXiv preprint arXiv:2311.10081.
- Masked-Attention Mask Transformer for Universal Image Segmentation. In CVPR.
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
- Palm: Scaling language modeling with pathways. JMLR, 24(240):1–113.
- MobileVLM V2: Faster and Stronger Baseline for Vision Language Model. arXiv preprint arXiv:2402.03766.
- Scaling Instruction-Finetuned Language Models. arXiv preprint arXiv:2210.11416.
- Scannet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In CVPR.
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv preprint arXiv:2305.06500.
- Qlora: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
- DreamLLM: Synergistic Multimodal Comprehension and Creation. arXiv preprint arXiv:2309.11499.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
- PaLM-E: An Embodied Multimodal Language Model. arXiv preprint arXiv:2303.03378.
- Taming Transformers for High-Resolution Image Synthesis. In CVPR.
- Eva: Exploring the limits of masked visual representation learning at scale. In CVPR.
- DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding. arXiv preprint arXiv:2311.11810.
- MMDialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. In ACL.
- Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness. arXiv preprint arXiv:2302.10893.
- MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394.
- Guiding Instruction-based Image Editing via Multimodal Large Language Models. In ICLR.
- Datacomp: In search of the next generation of multimodal datasets. In NeurIPS.
- LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arXiv preprint arXiv:2304.15010.
- SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models. arXiv preprint arXiv:2402.05935.
- Planting a SEED of Vision in Large Language Model. arXiv preprint arXiv:2307.08041.
- Making LLaMA SEE and Draw with SEED Tokenizer. arXiv preprint arXiv:2310.01218.
- ImageBind: One Embedding Space To Bind Them All. In CVPR.
- MultiModal-GPT: A Vision and Language Model for Dialogue with Humans. arXiv preprint arXiv:2305.04790.
- Making the v in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In CVPR.
- LVIS: A dataset for large vocabulary instance segmentation. In CVPR.
- VizWiz Grand Challenge: Answering Visual Questions From Blind People. In CVPR.
- Warp: Word-level adversarial reprogramming. arXiv preprint arXiv:2101.00121.
- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions. arXiv preprint arXiv:2307.01139.
- mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model. arXiv preprint arXiv:2311.18248.
- LoRA: Low-Rank Adaptation of Large Language Models. In ICLR.
- BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions. In AAAI.
- Language Is Not All You Need: Aligning Perception with Language Models. arXiv preprint arXiv:2302.14045.
- Visual Storytelling. In NAACL.
- SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models. arXiv preprint arXiv:2312.06739.
- Drew A Hudson and Christopher D Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In CVPR.
- Perceiver: General perception with iterative attention. In ICML.
- Mixtral of Experts. arXiv preprint arXiv:2401.04088.
- Unified Language-Vision Pretraining with Dynamic Discrete Visual Tokenization. arXiv preprint arXiv:2309.04669.
- Deep visual-semantic alignments for generating image descriptions. In CVPR.
- ReferItGame: Referring to Objects in Photographs of Natural Scenes. In EMNLP.
- Segment Anything. arXiv preprint arXiv:2304.02643.
- Generating Images with Multimodal Language Models. In NeurIPS.
- Grounding Language Models to Images for Multimodal Inputs and Outputs. In ICML.
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. IJCV, 123:32–73.
- The Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale. IJCV, 128:1956–1981.
- LISA: Reasoning Segmentation via Large Language Model. arXiv preprint arXiv:2308.00692.
- OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents. arXiv preprint arXiv:2306.16527.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
- Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726.
- SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension. arXiv preprint arXiv:2307.16125.
- LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. arXiv preprint arXiv:2306.00890.
- Semantic-SAM: Segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767.
- TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild. arXiv preprint arXiv:2309.08637.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv preprint arXiv:2301.12597.
- VideoChat: Chat-Centric Video Understanding. arXiv preprint arXiv:2305.06355.
- Grounded language-image pre-training. In CVPR.
- Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
- StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data. arXiv preprint arXiv:2308.10253.
- LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models. arXiv preprint arXiv:2311.17043.
- Evaluating Object Hallucination in Large Vision-Language Models. arXiv preprint arXiv:2305.10355.
- VILA: On Pre-training for Visual Language Models. arXiv preprint arXiv:2312.07533.
- Microsoft COCO: Common Objects in Context. In ECCV.
- SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models. arXiv preprint arXiv:2311.07575.
- Visual Spatial Reasoning. TACL, 11:635–651.
- Aligning Large Multi-Modal Model with Robust Instruction Tuning. arXiv preprint arXiv:2306.14565.
- Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning. arXiv preprint arXiv:2306.14565.
- Improved Baselines with Visual Instruction Tuning. arXiv preprint arXiv:2310.03744.
- Visual Instruction Tuning. In NeurIPS.
- Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare. arXiv preprint arXiv:2310.17956.
- One For All: Video Conversation is Feasible Without Video Instruction Tuning. arXiv preprint arXiv:2309.15785.
- LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents. arXiv preprint arXiv:2311.05437.
- Grounding DINO: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499.
- GPT understands, too. AI Open.
- MMBench: Is Your Multi-modal Model an All-around Player? arXiv preprint arXiv:2307.06281.
- InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language. arXiv preprint arXiv:2305.05662.
- Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action. arXiv preprint arXiv:2312.17172.
- MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. arXiv preprint arXiv:2310.02255.
- Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In NeurIPS.
- IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. In NeurIPS.
- Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models. arXiv preprint arXiv:2305.15023.
- Kosmos-2.5: A Multimodal Literate Model. arXiv preprint arXiv:2309.11419.
- Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens. arXiv preprint arXiv:2312.08870.
- Dolphins: Multimodal Language Model for Driving. arXiv preprint arXiv:2312.00438.
- Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. arXiv preprint arXiv:2306.05424.
- Generation and Comprehension of Unambiguous Object Descriptions. In CVPR.
- OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. In CVPR.
- Ocr-vqa: Visual question answering by reading text in images. In ICDAR.
- AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model. arXiv preprint arXiv:2309.16058.
- MosaicML. 2023. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs.
- EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought. arXiv preprint arXiv:2305.15021.
- PG-Video-LLaVA: Pixel Grounding Large Video-Language Models. arXiv preprint arXiv:2311.13435.
- OpenAI. 2022. Introducing ChatGPT.
- Training language models to follow instructions with human feedback. In NeurIPS.
- Kosmos-G: Generating Images in Context with Multimodal Large Language Models. arXiv preprint arXiv:2310.02992.
- X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning. arXiv preprint arXiv:2311.18799.
- Kosmos-2: Grounding Multimodal Large Language Models to the World. arXiv preprint arXiv:2306.14824.
- DetGPT: Detect What You Need via Reasoning. arXiv preprint arXiv:2305.14167.
- MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance. arXiv preprint arXiv:2401.02906.
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv preprint arXiv:2307.01952.
- Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model. arXiv preprint arXiv:2312.12423.
- Generalizable Entity Grounding via Assistance of Large Language Model. arXiv preprint arXiv:2402.02555.
- Learning transferable visual models from natural language supervision. In ICML.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(1):5485–5551.
- GLaMM : Pixel Grounding Large Multimodal Model. arXiv preprint arXiv:2311.03356.
- TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding. arXiv preprint arXiv:2312.02051.
- PixelLM: Pixel Reasoning with Large Multimodal Model. arXiv preprint arXiv:2312.02228.
- High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR.
- U-Net: Convolutional Networks for Biomedical Image Segmentation. In MICCAI.
- DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In CVPR.
- Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models. In CVPR.
- LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS.
- LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. In NeurIPS Workshops.
- Tiny LVLM-eHub: Early Multimodal Experiments with Bard. arXiv preprint arXiv:2308.03729.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL.
- TextCaps: A Dataset for Image Captioning with Reading Comprehension. In ECCV.
- Towards VQA Models That Can Read. In CVPR.
- PandaGPT: One Model To Instruction-Follow Them All. arXiv preprint arXiv:2305.16355.
- Generative Multimodal Models are In-Context Learners. arXiv preprint arXiv:2312.13286.
- Generative Pretraining in Multimodality. arXiv preprint arXiv:2307.05222.
- CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation. arXiv preprint arXiv:2311.18775.
- Stanford Alpaca: An Instruction-Following LLaMA Model.
- Unifying Language Learning Paradigms. arXiv preprint arXiv:2205.05131.
- MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer. arXiv preprint arXiv:2401.10208.
- ChatterBox: Multi-round Multimodal Referring and Grounding. arXiv preprint arXiv:2401.13307.
- Integrally Pre-Trained Transformer Pyramid Networks. In CVPR.
- LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Neural Discrete Representation Learning. In NeurIPS.
- Attention is all you need. In NeurIPS.
- CIDEr: Consensus-Based Image Description Evaluation. In CVPR.
- VIGC: Visual instruction generation and correction. arXiv preprint arXiv:2308.12714.
- Magneto: A Foundation Transformer. In ICML.
- CogVLM: Visual Expert for Pretrained Language Models. arXiv preprint arXiv:2311.03079.
- The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World. arXiv preprint arXiv:2308.01907.
- VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks. arXiv preprint arXiv:2305.11175.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. arXiv preprint arXiv:2204.07705.
- Lenna: Language enhanced reasoning detection assistant. arXiv preprint arXiv:2312.02433.
- Robust Fine-Tuning of Zero-Shot Models. In CVPR.
- Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv preprint arXiv:2303.04671.
- NExT-GPT: Any-to-Any Multimodal LLM. arXiv preprint arXiv:2309.05519.
- See, Say, and Segment: Teaching LMMs to Overcome False Premises. arXiv preprint arXiv:2312.08366.
- LLMGA: Multimodal Large Language Model based Generation Assistant. arXiv preprint arXiv:2311.16500.
- GSVA: Generalized Segmentation via Multimodal Large Language Models. arXiv preprint arXiv:2312.10103.
- Pixel Aligned Language Models. arXiv preprint arXiv:2312.09237.
- DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model. arXiv preprint arXiv:2310.01412.
- Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs. arXiv preprint arXiv:2310.00582.
- mT5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
- Panoptic scene graph generation. In ECCV.
- GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction. arXiv preprint arXiv:2305.18752.
- LISA++: An Improved Baseline for Reasoning Segmentation with Large Language Model. arXiv preprint arXiv:2312.17240.
- MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action. arXiv preprint arXiv:2303.11381.
- mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding. arXiv preprint arXiv:2307.02499.
- UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model. In EMNLP.
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv preprint arXiv:2304.14178.
- mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. arXiv preprint arXiv:2311.04257.
- Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045.
- FoodLMM: A Versatile Food Assistant using Large Multi-modal Model. arXiv preprint arXiv:2312.14991.
- LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark. In NeurIPS.
- Ferret: Refer and Ground Anything Anywhere at Any Granularity. arXiv preprint arXiv:2310.07704.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67–78.
- Modeling Context in Referring Expressions. In ECCV.
- Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning. arXiv preprint arXiv:2309.02591.
- MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. arXiv preprint arXiv:2308.02490.
- MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. arXiv preprint arXiv:2311.16502.
- Contextual Object Detection with Multimodal Large Language Models. arXiv preprint arXiv:2305.18279.
- Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models. arXiv preprint arXiv:2311.14552.
- NExT-Chat: An LMM for Chat, Detection and Segmentation. arXiv preprint arXiv:2311.04498.
- Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. In EMNLP.
- DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv preprint arXiv:2203.03605.
- A simple framework for open-vocabulary segmentation and detection. In CVPR.
- LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models. arXiv preprint arXiv:2312.02949.
- MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing. In NeurIPS.
- GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest. arXiv preprint arXiv:2307.03601.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering. arXiv preprint arXiv:2305.10415.
- LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding. arXiv preprint arXiv:2306.17107.
- SVIT: Scaling up Visual Instruction Tuning. arXiv preprint arXiv:2307.04087.
- ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning. arXiv preprint arXiv:2307.09474.
- BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs. arXiv preprint arXiv:2307.08581.
- ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst. arXiv preprint arXiv:2305.16103.
- MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens. arXiv preprint arXiv:2310.02239.
- RegionCLIP: Region-based Language-Image Pretraining. In CVPR.
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592.
- VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation. arXiv preprint arXiv:2312.09251.
- LLaFS: When Large-Language Models Meet Few-Shot Segmentation. arXiv preprint arXiv:2311.16926.
- Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939.
- Visual7W: Grounded Question Answering in Images. In CVPR.
- Davide Caffagni (5 papers)
- Federico Cocchi (7 papers)
- Luca Barsellotti (4 papers)
- Nicholas Moratelli (9 papers)
- Sara Sarto (12 papers)
- Lorenzo Baraldi (68 papers)
- Marcella Cornia (61 papers)
- Rita Cucchiara (142 papers)