Exploring Boundary of GPT-4V on Marine Analysis: A Preliminary Case Study (2401.02147v1)
Abstract: LLMs have demonstrated a powerful ability to answer various queries as a general-purpose assistant. The continuous multi-modal LLMs (MLLM) empower LLMs with the ability to perceive visual signals. The launch of GPT-4 (Generative Pre-trained Transformers) has generated significant interest in the research communities. GPT-4V(ison) has demonstrated significant power in both academia and industry fields, as a focal point in a new artificial intelligence generation. Though significant success was achieved by GPT-4V, exploring MLLMs in domain-specific analysis (e.g., marine analysis) that required domain-specific knowledge and expertise has gained less attention. In this study, we carry out the preliminary and comprehensive case study of utilizing GPT-4V for marine analysis. This report conducts a systematic evaluation of existing GPT-4V, assessing the performance of GPT-4V on marine research and also setting a new standard for future developments in MLLMs. The experimental results of GPT-4V show that the responses generated by GPT-4V are still far away from satisfying the domain-specific requirements of the marine professions. All images and prompts used in this study will be available at https://github.com/hkust-vgd/Marine_GPT-4V_Eval
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Towards automated annotation of benthic survey images: Variability of human experts and operational modes of automation. PloS one, 10(7):e0130312, 2015.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- From text to image: Exploring gpt-4vision’s potential in advanced radiological analysis across subspecialties. arXiv preprint arXiv:2311.14777, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023a.
- A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023b.
- G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023.
- Mllm-bench, evaluating multi-modal llms using gpt-4v. arXiv preprint arXiv:2311.13951, 2023.
- Marinedet: Towards open-marine object detection. arXiv preprint arXiv:2310.01931, 2023.
- 360vot: A new benchmark dataset for omnidirectional visual object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20566–20576, 2023.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023b.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Chatsim: Underwater simulation with natural language prompting. arXiv preprint arXiv:2308.04029, 2023.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023a.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023b.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022. doi: 10.48550/arXiv.2211.05100. URL https://doi.org/10.48550/arXiv.2211.05100.
- Assessing gpt4-v on structured reasoning tasks. arXiv preprint arXiv:2312.11524, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. doi: 10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Marine video kit: a new marine video dataset for content-based analysis and retrieval. In International Conference on Multimedia Modeling, pp. 539–550. Springer, 2023.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- A dataset with multibeam forward-looking sonar for underwater object detection. Scientific Data, 9(1):739, 2022.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1), 2023.
- OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022. doi: 10.48550/arXiv.2205.01068. URL https://doi.org/10.48550/arXiv.2205.01068.
- Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361, 2023.
- Marine video cloud: A cloud-based video analytics platform for collaborative marine research. In OCEANS 2023-Limerick, pp. 1–6. IEEE, 2023a.
- Real-time gan-based image enhancement for robust underwater monocular slam. Frontiers in Marine Science, 2023b.
- Marinegpt: Unlocking secrets of ocean to the public. arXiv preprint arXiv:2310.13596, 2023c.
- Exploring recommendation capabilities of gpt-4v (ision): A preliminary case study. arXiv preprint arXiv:2311.04199, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Coralvos: Dataset and benchmark for coral video segmentation. arXiv preprint arXiv:2310.01946, 2023.
- Ziqiang Zheng (16 papers)
- Yiwei Chen (19 papers)
- Jipeng Zhang (46 papers)
- Tuan-Anh Vu (14 papers)
- Huimin Zeng (25 papers)
- Yue Him Wong Tim (3 papers)
- Sai-Kit Yeung (52 papers)