Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges (2307.15016v2)

Published 27 Jul 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI. Notably, Bard has recently been updated to handle visual inputs alongside text prompts during conversations. Given Bard's impressive track record in handling textual inputs, we explore its capabilities in understanding and interpreting visual data (images) conditioned by text questions. This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models, especially in addressing complex computer vision problems that demand accurate visual and language understanding. Specifically, in this study, we focus on 15 diverse task scenarios encompassing regular, camouflaged, medical, under-water and remote sensing data to comprehensively evaluate Bard's performance. Our primary finding indicates that Bard still struggles in these vision scenarios, highlighting the significant gap in vision-based understanding that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, leading to enhanced capabilities in comprehending and interpreting fine-grained visual data. Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Haotong Qin (60 papers)
  2. Ge-Peng Ji (29 papers)
  3. Salman Khan (244 papers)
  4. Deng-Ping Fan (88 papers)
  5. Fahad Shahbaz Khan (225 papers)
  6. Luc Van Gool (570 papers)
Citations (14)

Summary

We haven't generated a summary for this paper yet.