Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions (2404.07214v2)
Abstract: The advent of LLMs has significantly reshaped the trajectory of the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily adept at processing textual information. To address this constraint, researchers have endeavored to integrate visual capabilities with LLMs, resulting in the emergence of Vision-LLMs (VLMs). These advanced models are instrumental in tackling more intricate tasks such as image captioning and visual question answering. In our comprehensive survey paper, we delve into the key advancements within the realm of VLMs. Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.This classification is based on their respective capabilities and functionalities in processing and generating various modalities of data.We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible, providing readers with a comprehensive understanding of its essential components. We also analyzed the performance of VLMs in various benchmark datasets. By doing so, we aim to offer a nuanced understanding of the diverse landscape of VLMs. Additionally, we underscore potential avenues for future research in this dynamic domain, anticipating further breakthroughs and advancements.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, 2016.
- Cm3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.
- Flamingo: a visual language model for few-shot learning. Neurips, 35:23716–23736, 2022.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021.
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts, 2022.
- Introducing our multimodal models, 2023.
- Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
- Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345, 2023.
- Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. arXiv preprint arXiv:2311.10081, 2023.
- Better bilingual multimodal model, 2023.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
- Pali: A jointly-scaled multilingual language-image model, 2023.
- Gaze estimation using transformer. In ICPR, pages 3341–3347. IEEE, 2022.
- Dall-eval: Probing the reasoning skills and social biases of text-to-image generation models. In ICCV, pages 3043–3054, 2023.
- Visual programming for text-to-image generation and evaluation. arXiv preprint arXiv:2305.15328, 2023.
- corovor.ai. Bharatgpt. 2023. https://corover.ai/bharatgpt/.
- Wenliang Dai and et al. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500, 2023.
- Vizwiz grand challenge: Answering visual questions from blind people, 2018.
- Danny Driess and et al. Palm-e: An embodied multimodal language model, 2023.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference, 2023.
- Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023.
- Imagebind: One embedding space to bind them all. In CVPR, pages 15180–15190, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Google. Google bard. 2023. https://ai.google/static/documents/google-about-bard.pdf.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017.
- From images to textual prompts: Zero-shot visual question answering with frozen large language models. In CVPR, pages 10867–10877, 2023.
- Bliva: A simple multimodal llm for better handling of text-rich visual questions. arXiv preprint arXiv:2308.09936, 2023.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019.
- A good prompt is worth millions of parameters: Low-resource prompt-based learning for vision-language models. arXiv preprint arXiv:2110.08484, 2021.
- Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
- Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023.
- Grounded language-image pre-training. In CVPR, pages 10965–10975, 2022.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Covlm: Composing visual entities and relationships in large language models via communicative decoding. arXiv preprint arXiv:2311.03354, 2023.
- Llama-vid: An image is worth 2 tokens in large language models, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
- M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTit: A large-scale dataset towards multi-modal multilingual instruction tuning, 2023.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
- Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023.
- Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Prismer: A vision-language model with an ensemble of experts. arXiv preprint arXiv:2303.02506, 2023.
- Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437, 2023.
- Improved baselines with visual instruction tuning, 2023.
- Llava-plus: Learning to use tools for creating multimodal agents, 2023.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Neurips, 35:2507–2521, 2022.
- Lyrics: Boosting fine-grained language-vision alignment and comprehension via semantic-aware visual objects. ArXiv, abs/2312.05278, 2023.
- Cheap and quick: Efficient vision-language instruction tuning for large language models. arXiv preprint arXiv:2305.15023, 2023.
- Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093, 2023.
- Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022.
- Med-flamingo: a multimodal medical few-shot learner. In ML4H, pages 353–367. PMLR, 2023.
- OpenAI. Gpt4. 2023. https://openai.com/research/gpt-4.
- X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. arXiv preprint arXiv:2311.18799, 2023.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Mirasol3b: A multimodal autoregressive model for time-aligned and contextual modalities. arXiv preprint arXiv:2311.05698, 2023.
- Decouple before interact: Multi-modal prompt learning for continual visual question answering. In ICCV, pages 2953–2962, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
- sarvam.ai. openhathi. 2023. https://www.sarvam.ai/blog/announcing-openhathi-series.
- Knowledge unlearning for llms: Tasks, methods, and challenges. arXiv preprint arXiv:2311.15766, 2023.
- Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
- Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023.
- Alpha-clip: A clip model focusing on wherever you want. arXiv e-prints, pages arXiv–2312, 2023.
- Any-to-any generation via composable diffusion, 2023.
- Codi-2: In-context, interleaved, and interactive any-to-any generation, 2023.
- Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training. arXiv preprint arXiv:2210.08773, 2022.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Neurips, 35:10078–10093, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Multimodal few-shot learning with frozen language models. Neurips, 34:200–212, 2021.
- vikhyatk. Moondream1. 2024. https://huggingface.co/vikhyatk.
- Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
- Large-scale multi-modal pre-trained models: A comprehensive survey. Machine Intelligence Research, pages 1–36, 2023.
- Multimodal large language models: A survey. arXiv preprint arXiv:2311.13165, 2023.
- Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
- Next-gpt: Any-to-any multimodal llm, 2023.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
- Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint arXiv:2212.10773, 2022.
- mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402, 2023.
- Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023.
- Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
- Videococa: Video-text modeling with zero-shot transfer from contrastive captioners, 2023.
- An empirical study of gpt-3 for few-shot knowledge-based vqa, 2022.
- Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv preprint arXiv:2305.18752, 2023.
- Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
- mplug-owl: Modularization empowers large language models with multimodality, 2023.
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
- Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
- Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
- Tinygpt-v: Efficient multimodal large language model via small backbones. arXiv preprint arXiv:2312.16862, 2023.
- Tinygpt-v: Efficient multimodal large language model via small backbones, 2023.
- X22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT-vlm: All-in-one pre-trained model for vision-language tasks, 2023.
- Vqacl: A novel visual question answering continual learning setting. In CVPR, pages 19102–19112, 2023.
- Toward building general foundation models for language, vision, and vision-language understanding tasks. arXiv preprint arXiv:2301.05065, 2023.
- Video-llama: An instruction-tuned audio-visual language model for video understanding, 2023.
- Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601, 2024.
- Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning. arXiv preprint arXiv:2307.09474, 2023.
- Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103, 2023.
- Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023.
- Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239, 2023.
- Towards automatic learning of procedures from web instructional videos. In AAAI, volume 32, 2018.
- Skingpt: A dermatology diagnostic system with vision large language model. arXiv preprint arXiv:2304.10691, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
- Akash Ghosh (14 papers)
- Arkadeep Acharya (5 papers)
- Sriparna Saha (48 papers)
- Vinija Jain (42 papers)
- Aman Chadha (109 papers)