Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models (2401.00127v1)
Abstract: $ $The synergy of language and vision models has given rise to Large Language and Vision Assistant models (LLVAs), designed to engage users in rich conversational experiences intertwined with image-based queries. These comprehensive multimodal models seamlessly integrate vision encoders with LLMs, expanding their applications in general-purpose language and visual comprehension. The advent of Large Multimodal Models (LMMs) heralds a new era in AI assistance, extending the horizons of AI utilization. This paper takes a unique perspective on LMMs, exploring their efficacy in performing image classification tasks using tailored prompts designed for specific datasets. We also investigate the LLVAs zero-shot learning capabilities. Our study includes a benchmarking analysis across four diverse datasets: MNIST, Cats Vs. Dogs, Hymnoptera (Ants Vs. Bees), and an unconventional dataset comprising Pox Vs. Non-Pox skin images. The results of our experiments demonstrate the model's remarkable performance, achieving classification accuracies of 85\%, 100\%, 77\%, and 79\% for the respective datasets without any fine-tuning. To bolster our analysis, we assess the model's performance post fine-tuning for specific tasks. In one instance, fine-tuning is conducted over a dataset comprising images of faces of children with and without autism. Prior to fine-tuning, the model demonstrated a test accuracy of 55\%, which significantly improved to 83\% post fine-tuning. These results, coupled with our prior findings, underscore the transformative potential of LLVAs and their versatile applications in real-world scenarios.
- N. S. Patil, R. S. Huang, C. B. van der Pol, and N. Larocque, “Using artificial intelligence chatbots as a radiologic decision-making tool for liver imaging: Do chatgpt and bard communicate information consistent with the acr appropriateness criteria?,” Journal of the American College of Radiology, 2023.
- A. Rao, M. Pang, J. Kim, M. Kamineni, W. Lie, A. K. Prasad, A. Landman, K. Dreyer, and M. D. Succi, “Assessing the utility of chatgpt throughout the entire clinical workflow: Development and usability study,” J Med Internet Res, vol. 25, p. e48659, Aug 2023.
- M. Y. Lee, “Building multimodal ai chatbots,” 2023.
- P. Handa, D. Chhabra, N. Goel, and S. Krishnan, “Exploring the role of chatgpt in medical image analysis,” Biomedical Signal Processing and Control, vol. 86, p. 105292, 2023.
- E. Bagdasaryan, T.-Y. Hsieh, B. Nassi, and V. Shmatikov, “Abusing images and sounds for indirect instruction injection in multi-modal llms,” 2023.
- H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” 2023.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021.
- R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” 2023.
- H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” 2023.
- D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- L. Deng, “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
- W. Cukierski, “Dogs vs. cats,” 2013.
- T. Melody, “Hymenoptera Data.”
- M. M. Ahsan, M. R. Uddin, M. Farjana, A. N. Sakib, K. A. Momin, and S. A. Luna, “Image data collection and implementation of deep learning-based model in detecting monkeypox disease using modified vgg16,” 2022.
- M. M. Ahsan, M. R. Uddin, and S. A. Luna, “Monkeypox image data collection,” 2022.
- G. , “Autistic Children Facial Image Data Set.” https://www.kaggle.com/discussions/general/123978. [Online; accessed 2023-11-13].
- Haotian-Liu, “GitHub - haotian-liu/LLaVA: [NeurIPS’23 Oral] Visual Instruction Tuning: LLaVA (Large Language-and-Vision Assistant) built towards GPT-4V level capabilities..”
- Ashhadul Islam (5 papers)
- Md. Rafiul Biswas (18 papers)
- Wajdi Zaghouani (26 papers)
- Samir Brahim Belhaouari (10 papers)
- Zubair Shah (22 papers)