- The paper presents an empirical study assessing LLMs' ability to solve MIT Math and EECS exam questions using a curated 4,550-question dataset.
- It reveals GPT-4’s perfect performance on non-image questions when enhanced with expert prompting and refined prompt engineering strategies.
- The study demonstrates the potential for LLM integration in education, offering insights for automated grading and curriculum design through dependency analysis.
Evaluation of LLMs on the MIT Mathematics and EECS Curriculum
Summary
The paper offers an empirical paper examining the abilities of various LLMs, focusing on their capability to process and solve academic-level questions from MIT's Mathematics and Electrical Engineering and Computer Science (EECS) curriculum. The authors curated a dataset comprising 4,550 questions and solutions from problem sets, midterm, and final exams across the curriculum, evaluating different models' abilities to meet graduation requirements.
The models under evaluation include GPT-3.5, GPT-4, StableVicuna, and LLaMA. Among these, GPT-4 demonstrates impressive performance, achieving a perfect solve rate on a curated test set when image-based questions are excluded. Techniques such as expert prompting and various prompt engineering strategies (e.g., few-shot learning, chain-of-thought) improved model performance. Notably, expert prompting emerged as a novel approach where models emulate answers that domain experts generally agree on.
Key Findings
GPT-3.5 and GPT-4 revealed varying levels of competence in addressing the provided question set. While GPT-3.5 managed to solve about one-third of the curriculum, GPT-4, especially with meticulously refined prompts, achieved complete success on a selected test set, absent images. The use of expert prompting, among other strategies, was identified as a fruitful method for enhancing performance.
A fine-tuned version of LLaMA on this dataset also exhibited improved performance, thus endorsing the utility of fine-tuning on specialized datasets. The analysis extended to benchmark comparisons, with GPT-4 outperforming other models like StableVicuna and LLaMA in both MIT's test dataset and the ReClor dataset, which consists of logical reasoning questions.
Practical and Theoretical Implications
The paper highlights the multifaceted capabilities of LLMs to facilitate academic learning, offering a dynamic tool to assist in curriculum design. By embedding course questions into vectors, the paper illustrates how dependencies among courses are discernible, offering insights for establishing prerequisites within the curriculum. This approach could enable a shift towards AI-assisted educational frameworks that coherently link various academic contents.
Furthermore, automatic grading capabilities using LLMs like GPT-4 could streamline evaluation processes, offering instructors a scalable, efficient method to assess student performance. The development of meta-questions represents another potential avenue for educational enhancement, encouraging students to critically engage with AI-generated answers to hone their evaluative skills.
Future Perspectives
The paper suggests integration rather than prohibition of LLMs in educational settings, with focus areas on course design and adaptive learning. As LLM capabilities expand, they signal substantial potential to reshape how curricula are structured and taught. There is also potential for further research into scaling these methods to graduate-level education and beyond, including more robust evaluations on datasets encompassing a broader academic spectrum.
While the current work adeptly showcases LLM capabilities in a specific curriculum context, future research should explore broader applications across different fields and institutions. As LLMs become more capable and context windows expand, integration into varied educational environments can offer greater flexibility and targeted learning support to students and educators alike.