Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams (2303.17003v1)

Published 29 Mar 2023 in cs.CL, cs.AI, and cs.LG

Abstract: The present study aims to explore the capabilities of LLMs (LMs) in tackling high-stakes multiple-choice tests, represented here by the Exame Nacional do Ensino M\'edio (ENEM), a multidisciplinary entrance examination widely adopted by Brazilian universities. This exam poses challenging tasks for LMs, since its questions may span into multiple fields of knowledge, requiring understanding of information from diverse domains. For instance, a question may require comprehension of both statistics and biology to be solved. This work analyzed responses generated by GPT-3.5 and GPT-4 models for questions presented in the 2009-2017 exams, as well as for questions of the 2022 exam, which were made public after the training of the models was completed. Furthermore, different prompt strategies were tested, including the use of Chain-of-Thought (CoT) prompts to generate explanations for answers. On the 2022 edition, the best-performing model, GPT-4 with CoT, achieved an accuracy of 87%, largely surpassing GPT-3.5 by 11 points. The code and data used on experiments are available at https://github.com/piresramon/gpt-4-enem.

PDF Abstract

Evaluation of GPT-3.5 and GPT-4 on Brazilian University Admission Exams

The paper titled "Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams" presents a comprehensive analysis of the ability of advanced LLMs (LMs)—specifically, GPT-3.5 and GPT-4—to handle the complex tasks posed by the Exame Nacional do Ensino Médio (ENEM), a multifaceted admissions exam widely utilized by Brazilian universities. The authors investigate the effectiveness of these models using various prompting strategies, including zero-shot, few-shot, and Chain-of-Thought (CoT) methodologies. The findings contribute to the discourse on the potential applications of LMs in educational contexts and highlight critical aspects of the models' performance in handling non-English, high-difficulty assessments.

Key Findings

The paper reports several important outcomes of the evaluation:

GPT-4 Performance: Among the tested models, GPT-4 demonstrated superior performance, achieving up to 94.56% accuracy on the 2009-2017 ENEM questions and 87.29% on the ENEM 2022 edition. This significant leap from GPT-3.5 (which topped out around 82.88% accuracy) underscores GPT-4's advancements in NLP capabilities.
Impact of Prompt Engineering: The introduction of CoT prompts, which encourage models to progress through logical sequences before arriving at a final answer, yielded a marked enhancement in solving mathematical problems—a traditionally challenging area for LMs. For example, accuracy in mathematics questions for the GPT-4 increased from 50.00% to 72.73% when utilizing CoT prompts.
Cross-Domain Language Processing: The research illustrates the robustness of GPT-4 in processing questions requiring interdisciplinary knowledge, as evidenced by its high scores in human and natural sciences. This supports the notion that advanced LMs are becoming proficient in integrating and reasoning across different knowledge sectors—even those outside of their primary linguistic training.

Implications and Future Research

The implications of this paper are multifold. Practically, the high performance of GPT-4 models demonstrates their readiness to support educational applications that require nuanced understanding and explanation, such as automated tutoring systems, personalized learning tools, and educational content generation. The research aptly points out the capacities for LMs to assist in precision education by offering adaptive testing mechanisms and enriched academic content that caters to varied learner needs.

From a theoretical standpoint, the investigation into CoT prompts highlights an important avenue in prompt engineering, revealing a valuable tactic for unlocking deeper reasoning pathways in LMs. Further optimization of these prompting techniques can lead to models capable of more complex problem-solving and decision-making tasks.

Future work as suggested by the authors should explore the effects of multimodal capabilities in new model architectures, such as those able to process both text and visual inputs. This exploration would further enhance the models' utility in real-world educational scenarios, especially in assessments demanding visual comprehension.

Additionally, generating new question sets and predicting question difficulty through LMs remains an open frontier that promises to refine the adaptability of educational assessments. Furthermore, the paper underscores the necessity for transparency and continued verification to rule out memorization biases, ensuring that models demonstrate genuine understanding and application of knowledge.

In conclusion, the research provides critical insights into the state of the art of LLMs' application in challenging educational domains and opens pathways for further exploration and development in AI-driven learning solutions.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Desnes Nunes (1 paper)
Ricardo Primi (1 paper)
Ramon Pires (11 papers)
Roberto Lotufo (41 papers)
Rodrigo Nogueira (70 papers)

Citations (30)

View on Semantic Scholar

Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams (2303.17003v1)

Evaluation of GPT-3.5 and GPT-4 on Brazilian University Admission Exams

Key Findings

Implications and Future Research

Related Papers

GitHub

YouTube