Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gemini Pro Defeated by GPT-4V: Evidence from Education (2401.08660v1)

Published 27 Dec 2023 in cs.AI and cs.CL

Abstract: This study compared the classification performance of Gemini Pro and GPT-4V in educational settings. Employing visual question answering (VQA) techniques, the study examined both models' abilities to read text-based rubrics and then automatically score student-drawn models in science education. We employed both quantitative and qualitative analyses using a dataset derived from student-drawn scientific models and employing NERIF (Notation-Enhanced Rubrics for Image Feedback) prompting methods. The findings reveal that GPT-4V significantly outperforms Gemini Pro in terms of scoring accuracy and Quadratic Weighted Kappa. The qualitative analysis reveals that the differences may be due to the models' ability to process fine-grained texts in images and overall image classification performance. Even adapting the NERIF approach by further de-sizing the input images, Gemini Pro seems not able to perform as well as GPT-4V. The findings suggest GPT-4V's superior capability in handling complex multimodal educational tasks. The study concludes that while both models represent advancements in AI, GPT-4V's higher performance makes it a more suitable tool for educational applications involving multimodal data interpretation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. An In-depth Look at Gemini’s Language Abilities. arXiv preprint arXiv:2312.11444 (2023).
  2. An In-depth Look at Gemini’s Language Abilities. eprint: 2312.11444 (2023).
  3. VQA: Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.
  4. Can chat GPT replace the role of the teacher in the classroom: A fundamental analysis. Journal on Education 5, 4 (2023), 16100–16106.
  5. Fully Authentic Visual Question Answering Dataset from Online Communities. arXiv preprint arXiv:2311.15562 (2023).
  6. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021).
  8. The impact of ChatGPT on higher education. Dempere J, Modugu K, Hesham A and Ramasamy LK (2023) The impact of ChatGPT on higher education. Front. Educ 8 (2023), 1206936.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  10. Using GPT-4 to Augment Unbalanced Data for Automatic Scoring. arXiv:2310.18365 [cs.CL]
  11. A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise. arXiv preprint arXiv:2312.12436 (2023).
  12. Unlocking the power of generative AI models and systems such as GPT-4 and ChatGPT for higher education: A guide for students and lecturers. Technical Report. Hohenheim Discussion Papers in Business, Economics and Social Sciences.
  13. Ben Goertzel. 2014. Artificial general intelligence: concept, state of the art, and future prospects. Journal of Artificial General Intelligence 5, 1 (2014), 1.
  14. Google. 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805 (2023).
  15. Creating and using instructionally supportive assessments in NGSS classrooms. NSTA Press. xx–xx pages.
  16. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701 (2020).
  17. Vision, challenges, roles and research issues of Artificial Intelligence in Education. Computers and Education: Artificial Intelligence 1 (2020), 100001.
  18. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture. 1–14.
  19. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
  20. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226 (2018).
  21. J. Richard Landis and Gary G. Koch. 1977. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics (1977), 363–374.
  22. AGI: Artificial General Intelligence for Education. (2023), arXiv:2304.12479. https://doi.org/10.48550/arXiv.2304.12479
  23. Ehsan Latif and Xiaoming Zhai. 2023. Automatic Scoring of Students’ Science Writing Using Hybrid Neural Network. arXiv preprint arXiv:2312.03752 (2023).
  24. AI Gender Bias, Disparities, and Fairness: Does Training Data Matter? arXiv preprint arXiv:2312.10833 (2023).
  25. Applying Large Language Models and Chain-of-Thought for Automatic Scoring. arXiv preprint arXiv:2312.03748 (2023).
  26. Multimodality of AI for Education: Towards Artificial General Intelligence. arXiv preprint arXiv:2312.06037 (2023). https://doi.org/10.48550/arXiv.2312.06037
  27. Gyeong-Geon Lee and Xiaoming Zhai. 2023. NERIF: GPT-4V for Automatic Scoring of Drawn Models. arXiv:2311.12990 [cs.AI]
  28. Automated Assessment of Student Hand Drawings in Free-Response Items on the Particulate Nature of Matter. Journal of Science Education and Technology 32 (2023), 549–566. https://doi.org/10.1007/s10956-023-10042-3
  29. Few-shot is enough: exploring ChatGPT prompt engineering method for automatic question generation in English education. Education and Information Technologies (2023). https://doi.org/10.1007/s10639-023-12249-8
  30. A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering. arXiv preprint arXiv:2311.07536 (2023).
  31. Mengchen Liu and Chongyan Chen. 2023. An Evaluation of GPT-4V and Gemini in Online VQA. arXiv preprint arXiv:2312.10637 (2023).
  32. Chung Kwan Lo. 2023. What Is the Impact of ChatGPT on Education? A Rapid Review of the Literature. Education Sciences 13, 4 (2023), 410. https://doi.org/10.3390/educsci13040410
  33. Intelligence Unleashed: An Argument for AI in Education. Pearson Education.
  34. GPT-4V (ision) as A Social Media Analysis Engine. arXiv preprint arXiv:2311.07547 (2023).
  35. From Google Gemini to OpenAI Q*(Q-Star): A Survey of Reshaping the Generative Artificial Intelligence (AI) Research Landscape. arXiv preprint arXiv:2312.10868 (2023).
  36. NGSS Lead States. 2013. Next Generation Science Standards: For States, By States. National Academies Press.
  37. Open AI. 2023a. ChatGPT Can Now See, Hear, And Speak. https://openai.com/blog/chatgpt-can-now-see-hear-and-speak. Published on September 25, 2023.
  38. Open AI. 2023b. GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  39. Open AI. 2023c. Gpt-4v(ision) system card. (2023).
  40. Open AI. 2023d. Gpt-4v(ision) technical work and authors. (2023).
  41. Organisation for Economic Co-operation and Development. 2019. OECD Future of Education and Skills 2030: OECD Learning Compass 2030 – A Series of Concept Notes. OECD Publishing.
  42. GPT-4V (ision) Unsuitable for Clinical Care and Education: A Clinician-Evaluated Assessment. medRxiv (2023), 2023–11.
  43. Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150 (2019).
  44. Attention is all you need. Advances in neural information processing systems 30 (2017).
  45. Google’s AI chatbot “Bard”: a side-by-side comparison with ChatGPT and its utilization in ophthalmology. Eye (2023), 1–4.
  46. Applying Machine Learning to Assess Paper-Pencil Drawn Models of Optics. Oxford University Press, UK, Oxford.
  47. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023).
  48. Evaluation of a digital ophthalmologist app built by GPT4-V (ision). medRxiv (2023), 2023–11.
  49. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9, 1 (2023).
  50. Applications of GPT-4 for Accurate Diagnosis of Retinal Diseases Through Optical coherence tomography Image Recognition. (2023).
  51. Xiaoming Zhai. 2022. ChatGPT user experience: Implications for education. Available at SSRN 4312418 (2022).
  52. From substitution to redefinition: A framework of machine learning-based science assessment. Journal of Research in Science Teaching 57, 9 (2020), 1430–1459. https://doi.org/10.1002/tea.21658
  53. Applying machine learning to automatically assess scientific models. Journal of Research in Science Teaching 59, 10 (2022), 1765–1794.
  54. Applying machine learning in science assessment: a systematic review. Studies in Science Education 56, 1 (2020), 111–151.
  55. Exploring recommendation capabilities of gpt-4v (ision): A preliminary case study. arXiv preprint arXiv:2311.04199 (2023).
  56. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Gyeong-Geon Lee (11 papers)
  2. Ehsan Latif (36 papers)
  3. Lehong Shi (6 papers)
  4. Xiaoming Zhai (48 papers)
Citations (16)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com