Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Boundary of GPT-4V on Marine Analysis: A Preliminary Case Study (2401.02147v1)

Published 4 Jan 2024 in cs.CL and cs.CV

Abstract: LLMs have demonstrated a powerful ability to answer various queries as a general-purpose assistant. The continuous multi-modal LLMs (MLLM) empower LLMs with the ability to perceive visual signals. The launch of GPT-4 (Generative Pre-trained Transformers) has generated significant interest in the research communities. GPT-4V(ison) has demonstrated significant power in both academia and industry fields, as a focal point in a new artificial intelligence generation. Though significant success was achieved by GPT-4V, exploring MLLMs in domain-specific analysis (e.g., marine analysis) that required domain-specific knowledge and expertise has gained less attention. In this study, we carry out the preliminary and comprehensive case study of utilizing GPT-4V for marine analysis. This report conducts a systematic evaluation of existing GPT-4V, assessing the performance of GPT-4V on marine research and also setting a new standard for future developments in MLLMs. The experimental results of GPT-4V show that the responses generated by GPT-4V are still far away from satisfying the domain-specific requirements of the marine professions. All images and prompts used in this study will be available at https://github.com/hkust-vgd/Marine_GPT-4V_Eval

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Towards automated annotation of benthic survey images: Variability of human experts and operational modes of automation. PloS one, 10(7):e0130312, 2015.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  5. From text to image: Exploring gpt-4vision’s potential in advanced radiological analysis across subspecialties. arXiv preprint arXiv:2311.14777, 2023.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  7. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023a.
  8. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023b.
  9. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023.
  10. Mllm-bench, evaluating multi-modal llms using gpt-4v. arXiv preprint arXiv:2311.13951, 2023.
  11. Marinedet: Towards open-marine object detection. arXiv preprint arXiv:2310.01931, 2023.
  12. 360vot: A new benchmark dataset for omnidirectional visual object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  20566–20576, 2023.
  13. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  14. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023b.
  15. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
  16. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  17. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  18. OpenAI. Gpt-4 technical report, 2023.
  19. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  20. Chatsim: Underwater simulation with natural language prompting. arXiv preprint arXiv:2308.04029, 2023.
  21. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023a.
  22. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023b.
  23. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  24. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022. doi: 10.48550/arXiv.2211.05100. URL https://doi.org/10.48550/arXiv.2211.05100.
  25. Assessing gpt4-v on structured reasoning tasks. arXiv preprint arXiv:2312.11524, 2023.
  26. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  27. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. doi: 10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
  28. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  29. Marine video kit: a new marine video dataset for content-based analysis and retrieval. In International Conference on Multimedia Modeling, pp. 539–550. Springer, 2023.
  30. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  31. A dataset with multibeam forward-looking sonar for underwater object detection. Scientific Data, 9(1):739, 2022.
  32. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1), 2023.
  33. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022. doi: 10.48550/arXiv.2205.01068. URL https://doi.org/10.48550/arXiv.2205.01068.
  34. Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361, 2023.
  35. Marine video cloud: A cloud-based video analytics platform for collaborative marine research. In OCEANS 2023-Limerick, pp.  1–6. IEEE, 2023a.
  36. Real-time gan-based image enhancement for robust underwater monocular slam. Frontiers in Marine Science, 2023b.
  37. Marinegpt: Unlocking secrets of ocean to the public. arXiv preprint arXiv:2310.13596, 2023c.
  38. Exploring recommendation capabilities of gpt-4v (ision): A preliminary case study. arXiv preprint arXiv:2311.04199, 2023.
  39. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  40. Coralvos: Dataset and benchmark for coral video segmentation. arXiv preprint arXiv:2310.01946, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ziqiang Zheng (16 papers)
  2. Yiwei Chen (19 papers)
  3. Jipeng Zhang (46 papers)
  4. Tuan-Anh Vu (14 papers)
  5. Huimin Zeng (25 papers)
  6. Yue Him Wong Tim (3 papers)
  7. Sai-Kit Yeung (52 papers)
Citations (3)