Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions (2404.07214v2)

Published 20 Feb 2024 in cs.CV, cs.AI, and cs.CL

Abstract: The advent of LLMs has significantly reshaped the trajectory of the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily adept at processing textual information. To address this constraint, researchers have endeavored to integrate visual capabilities with LLMs, resulting in the emergence of Vision-LLMs (VLMs). These advanced models are instrumental in tackling more intricate tasks such as image captioning and visual question answering. In our comprehensive survey paper, we delve into the key advancements within the realm of VLMs. Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.This classification is based on their respective capabilities and functionalities in processing and generating various modalities of data.We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible, providing readers with a comprehensive understanding of its essential components. We also analyzed the performance of VLMs in various benchmark datasets. By doing so, we aim to offer a nuanced understanding of the diverse landscape of VLMs. Additionally, we underscore potential avenues for future research in this dynamic domain, anticipating further breakthroughs and advancements.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (116)

Authors (5)

Akash Ghosh (14 papers)
Arkadeep Acharya (5 papers)
Sriparna Saha (48 papers)
Vinija Jain (42 papers)
Aman Chadha (109 papers)

Citations (9)

View on Semantic Scholar

Tweets

https://twitter.com/BlancheMinerva/status/1794036826606342199

https://twitter.com/i_amanchadha/status/1782963991985455508

https://twitter.com/CSVisionPapers/status/1778687122154615242

https://twitter.com/gastronomy/status/1778635031054864848

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions (2404.07214v2)

Related Papers

Tweets