Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating GPT-4's Vision Capabilities on Brazilian University Admission Exams (2311.14169v1)

Published 23 Nov 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Recent advancements in LLMs have showcased human-comparable performance in academic entrance exams. However, existing studies often overlook questions that require the integration of visual comprehension, thus compromising the full spectrum and complexity inherent in real-world scenarios. To address this gap, we present a comprehensive framework to evaluate LLMs on entrance exams, which incorporates both textual and visual elements. We evaluate the two most recent editions of Exame Nacional do Ensino M\'edio (ENEM), the main standardized entrance examination adopted by Brazilian universities. Our study not only reaffirms the capabilities of GPT-4 as the state of the art for handling complex multidisciplinary questions, but also pioneers in offering a realistic assessment of multimodal LLMs on Portuguese examinations. One of the highlights is that text captions transcribing visual content outperform the direct use of images, suggesting that the vision model has room for improvement. Yet, despite improvements afforded by images or captions, mathematical questions remain a challenge for these state-of-the-art models. The code and data used on experiments are available at https://github.com/piresramon/gpt-4-enem.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Scaling language models: Methods, analysis & insights from training gopher, 2022.
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  5. Training compute-optimal large language models, 2022.
  6. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  7. OpenAI. Gpt-4 technical report, 2023.
  8. Evaluating the performance of large language models on gaokao benchmark, 2023.
  9. Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams, 2023.
  10. Bluex: A benchmark based on brazilian leading universities entrance exams. In Murilo C. Naldi and Reinaldo A. C. Bianchi, editors, Intelligent Systems, pages 337–347, Cham, 2023. Springer Nature Switzerland.
  11. Machine learning in education-a survey of current research trends. Annals of DAAAM & Proceedings, 29, 2018.
  12. A review of using machine learning approaches for precision education. Educational Technology & Society, 24(1):250–266, 2021.
  13. Gpt takes the bar exam. Available at SSRN 4314839, 2022.
  14. Gpt as Knowledge Worker: A Zero-Shot Evaluation of (AI)CPA Capabilities. SSRN Electronic Journal, 2023.
  15. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022.
  16. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  17. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  18. Sabiá: Portuguese large language models. In Murilo C. Naldi and Reinaldo A. C. Bianchi, editors, Intelligent Systems, pages 226–240, Cham, 2023. Springer Nature Switzerland.
  19. Evaluating chatgpt efficacy in navigating the spanish medical residency entrance examination (mir): A new horizon for ai in clinical medicine. Preprints, September 2023.
  20. Advances in automatically solving the enem. In 2018 7th Brazilian Conference on Intelligent Systems (BRACIS), pages 43–48, Los Alamitos, CA, USA, oct 2018. IEEE Computer Society.
  21. Efficient Estimation of Word Representations in Vector Space.
  22. George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, nov 1995.
  23. Adobe. Adobe pdf services python sdk. https://github.com/adobe/pdfservices-python-sdk, 2023.
  24. Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903, 2022.
  25. Towards language models that can see: Computer vision through the lens of natural language, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ramon Pires (11 papers)
  2. Thales Sales Almeida (10 papers)
  3. Hugo Abonizio (12 papers)
  4. Rodrigo Nogueira (70 papers)
Citations (3)