Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models (2401.00127v1)

Published 30 Dec 2023 in cs.CV and cs.SI

Abstract: $ $The synergy of language and vision models has given rise to Large Language and Vision Assistant models (LLVAs), designed to engage users in rich conversational experiences intertwined with image-based queries. These comprehensive multimodal models seamlessly integrate vision encoders with LLMs, expanding their applications in general-purpose language and visual comprehension. The advent of Large Multimodal Models (LMMs) heralds a new era in AI assistance, extending the horizons of AI utilization. This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts designed for specific datasets. We also investigate the LLVAs zero-shot learning capabilities. Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images. The results of our experiments demonstrate the model's remarkable performance, achieving classification accuracies of 85\%, 100\%, 77\%, and 79\% for the respective datasets without any fine-tuning. To bolster our analysis, we assess the model's performance post fine-tuning for specific tasks. In one instance, fine-tuning is conducted over a dataset comprising images of faces of children with and without autism. Prior to fine-tuning, the model demonstrated a test accuracy of 55\%, which significantly improved to 83\% post fine-tuning. These results, coupled with our prior findings, underscore the transformative potential of LLVAs and their versatile applications in real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. N. S. Patil, R. S. Huang, C. B. van der Pol, and N. Larocque, “Using artificial intelligence chatbots as a radiologic decision-making tool for liver imaging: Do chatgpt and bard communicate information consistent with the acr appropriateness criteria?,” Journal of the American College of Radiology, 2023.
  2. A. Rao, M. Pang, J. Kim, M. Kamineni, W. Lie, A. K. Prasad, A. Landman, K. Dreyer, and M. D. Succi, “Assessing the utility of chatgpt throughout the entire clinical workflow: Development and usability study,” J Med Internet Res, vol. 25, p. e48659, Aug 2023.
  3. M. Y. Lee, “Building multimodal ai chatbots,” 2023.
  4. P. Handa, D. Chhabra, N. Goel, and S. Krishnan, “Exploring the role of chatgpt in medical image analysis,” Biomedical Signal Processing and Control, vol. 86, p. 105292, 2023.
  5. E. Bagdasaryan, T.-Y. Hsieh, B. Nassi, and V. Shmatikov, “Abusing images and sounds for indirect instruction injection in multi-modal llms,” 2023.
  6. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” 2023.
  7. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021.
  8. R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” 2023.
  9. H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” 2023.
  10. D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  11. L. Deng, “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
  12. W. Cukierski, “Dogs vs. cats,” 2013.
  13. T. Melody, “Hymenoptera Data.”
  14. M. M. Ahsan, M. R. Uddin, M. Farjana, A. N. Sakib, K. A. Momin, and S. A. Luna, “Image data collection and implementation of deep learning-based model in detecting monkeypox disease using modified vgg16,” 2022.
  15. M. M. Ahsan, M. R. Uddin, and S. A. Luna, “Monkeypox image data collection,” 2022.
  16. G. , “Autistic Children Facial Image Data Set.” https://www.kaggle.com/discussions/general/123978. [Online; accessed 2023-11-13].
  17. Haotian-Liu, “GitHub - haotian-liu/LLaVA: [NeurIPS’23 Oral] Visual Instruction Tuning: LLaVA (Large Language-and-Vision Assistant) built towards GPT-4V level capabilities..”
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ashhadul Islam (5 papers)
  2. Md. Rafiul Biswas (18 papers)
  3. Wajdi Zaghouani (26 papers)
  4. Samir Brahim Belhaouari (10 papers)
  5. Zubair Shah (22 papers)
Citations (3)