Gemini Pro Defeated by GPT-4V: Evidence from Education (2401.08660v1)
Abstract: This study compared the classification performance of Gemini Pro and GPT-4V in educational settings. Employing visual question answering (VQA) techniques, the study examined both models' abilities to read text-based rubrics and then automatically score student-drawn models in science education. We employed both quantitative and qualitative analyses using a dataset derived from student-drawn scientific models and employing NERIF (Notation-Enhanced Rubrics for Image Feedback) prompting methods. The findings reveal that GPT-4V significantly outperforms Gemini Pro in terms of scoring accuracy and Quadratic Weighted Kappa. The qualitative analysis reveals that the differences may be due to the models' ability to process fine-grained texts in images and overall image classification performance. Even adapting the NERIF approach by further de-sizing the input images, Gemini Pro seems not able to perform as well as GPT-4V. The findings suggest GPT-4V's superior capability in handling complex multimodal educational tasks. The study concludes that while both models represent advancements in AI, GPT-4V's higher performance makes it a more suitable tool for educational applications involving multimodal data interpretation.
- An In-depth Look at Gemini’s Language Abilities. arXiv preprint arXiv:2312.11444 (2023).
- An In-depth Look at Gemini’s Language Abilities. eprint: 2312.11444 (2023).
- VQA: Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425–2433.
- Can chat GPT replace the role of the teacher in the classroom: A fundamental analysis. Journal on Education 5, 4 (2023), 16100–16106.
- Fully Authentic Visual Question Answering Dataset from Online Communities. arXiv preprint arXiv:2311.15562 (2023).
- Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021).
- The impact of ChatGPT on higher education. Dempere J, Modugu K, Hesham A and Ramasamy LK (2023) The impact of ChatGPT on higher education. Front. Educ 8 (2023), 1206936.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Using GPT-4 to Augment Unbalanced Data for Automatic Scoring. arXiv:2310.18365 [cs.CL]
- A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise. arXiv preprint arXiv:2312.12436 (2023).
- Unlocking the power of generative AI models and systems such as GPT-4 and ChatGPT for higher education: A guide for students and lecturers. Technical Report. Hohenheim Discussion Papers in Business, Economics and Social Sciences.
- Ben Goertzel. 2014. Artificial general intelligence: concept, state of the art, and future prospects. Journal of Artificial General Intelligence 5, 1 (2014), 1.
- Google. 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2312.11805 (2023).
- Creating and using instructionally supportive assessments in NGSS classrooms. NSTA Press. xx–xx pages.
- Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701 (2020).
- Vision, challenges, roles and research issues of Artificial Intelligence in Education. Computers and Education: Artificial Intelligence 1 (2020), 100001.
- Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture. 1–14.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
- Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226 (2018).
- J. Richard Landis and Gary G. Koch. 1977. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics (1977), 363–374.
- AGI: Artificial General Intelligence for Education. (2023), arXiv:2304.12479. https://doi.org/10.48550/arXiv.2304.12479
- Ehsan Latif and Xiaoming Zhai. 2023. Automatic Scoring of Students’ Science Writing Using Hybrid Neural Network. arXiv preprint arXiv:2312.03752 (2023).
- AI Gender Bias, Disparities, and Fairness: Does Training Data Matter? arXiv preprint arXiv:2312.10833 (2023).
- Applying Large Language Models and Chain-of-Thought for Automatic Scoring. arXiv preprint arXiv:2312.03748 (2023).
- Multimodality of AI for Education: Towards Artificial General Intelligence. arXiv preprint arXiv:2312.06037 (2023). https://doi.org/10.48550/arXiv.2312.06037
- Gyeong-Geon Lee and Xiaoming Zhai. 2023. NERIF: GPT-4V for Automatic Scoring of Drawn Models. arXiv:2311.12990 [cs.AI]
- Automated Assessment of Student Hand Drawings in Free-Response Items on the Particulate Nature of Matter. Journal of Science Education and Technology 32 (2023), 549–566. https://doi.org/10.1007/s10956-023-10042-3
- Few-shot is enough: exploring ChatGPT prompt engineering method for automatic question generation in English education. Education and Information Technologies (2023). https://doi.org/10.1007/s10639-023-12249-8
- A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering. arXiv preprint arXiv:2311.07536 (2023).
- Mengchen Liu and Chongyan Chen. 2023. An Evaluation of GPT-4V and Gemini in Online VQA. arXiv preprint arXiv:2312.10637 (2023).
- Chung Kwan Lo. 2023. What Is the Impact of ChatGPT on Education? A Rapid Review of the Literature. Education Sciences 13, 4 (2023), 410. https://doi.org/10.3390/educsci13040410
- Intelligence Unleashed: An Argument for AI in Education. Pearson Education.
- GPT-4V (ision) as A Social Media Analysis Engine. arXiv preprint arXiv:2311.07547 (2023).
- From Google Gemini to OpenAI Q*(Q-Star): A Survey of Reshaping the Generative Artificial Intelligence (AI) Research Landscape. arXiv preprint arXiv:2312.10868 (2023).
- NGSS Lead States. 2013. Next Generation Science Standards: For States, By States. National Academies Press.
- Open AI. 2023a. ChatGPT Can Now See, Hear, And Speak. https://openai.com/blog/chatgpt-can-now-see-hear-and-speak. Published on September 25, 2023.
- Open AI. 2023b. GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
- Open AI. 2023c. Gpt-4v(ision) system card. (2023).
- Open AI. 2023d. Gpt-4v(ision) technical work and authors. (2023).
- Organisation for Economic Co-operation and Development. 2019. OECD Future of Education and Skills 2030: OECD Learning Compass 2030 – A Series of Concept Notes. OECD Publishing.
- GPT-4V (ision) Unsuitable for Clinical Care and Education: A Clinician-Evaluated Assessment. medRxiv (2023), 2023–11.
- Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150 (2019).
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Google’s AI chatbot “Bard”: a side-by-side comparison with ChatGPT and its utilization in ophthalmology. Eye (2023), 1–4.
- Applying Machine Learning to Assess Paper-Pencil Drawn Models of Optics. Oxford University Press, UK, Oxford.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023).
- Evaluation of a digital ophthalmologist app built by GPT4-V (ision). medRxiv (2023), 2023–11.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421 9, 1 (2023).
- Applications of GPT-4 for Accurate Diagnosis of Retinal Diseases Through Optical coherence tomography Image Recognition. (2023).
- Xiaoming Zhai. 2022. ChatGPT user experience: Implications for education. Available at SSRN 4312418 (2022).
- From substitution to redefinition: A framework of machine learning-based science assessment. Journal of Research in Science Teaching 57, 9 (2020), 1430–1459. https://doi.org/10.1002/tea.21658
- Applying machine learning to automatically assess scientific models. Journal of Research in Science Teaching 59, 10 (2022), 1765–1794.
- Applying machine learning in science assessment: a systematic review. Studies in Science Education 56, 1 (2020), 111–151.
- Exploring recommendation capabilities of gpt-4v (ision): A preliminary case study. arXiv preprint arXiv:2311.04199 (2023).
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
- Gyeong-Geon Lee (11 papers)
- Ehsan Latif (36 papers)
- Lehong Shi (6 papers)
- Xiaoming Zhai (48 papers)