Exploring the MIT Mathematics and EECS Curriculum Using Large Language Models (2306.08997v2)

Published 15 Jun 2023 in cs.CL, cs.AI, and cs.LG

Abstract: We curate a comprehensive dataset of 4,550 questions and solutions from problem sets, midterm exams, and final exams across all MIT Mathematics and Electrical Engineering and Computer Science (EECS) courses required for obtaining a degree. We evaluate the ability of LLMs to fulfill the graduation requirements for any MIT major in Mathematics and EECS. Our results demonstrate that GPT-3.5 successfully solves a third of the entire MIT curriculum, while GPT-4, with prompt engineering, achieves a perfect solve rate on a test set excluding questions based on images. We fine-tune an open-source LLM on this dataset. We employ GPT-4 to automatically grade model responses, providing a detailed performance breakdown by course, question, and answer type. By embedding questions in a low-dimensional space, we explore the relationships between questions, topics, and classes and discover which questions and classes are required for solving other questions and classes through few-shot learning. Our analysis offers valuable insights into course prerequisites and curriculum design, highlighting LLMs' potential for learning and improving Mathematics and EECS education.

Citations (18)

View on Semantic Scholar

Summary

The paper presents an empirical study assessing LLMs' ability to solve MIT Math and EECS exam questions using a curated 4,550-question dataset.
It reveals GPT-4’s perfect performance on non-image questions when enhanced with expert prompting and refined prompt engineering strategies.
The study demonstrates the potential for LLM integration in education, offering insights for automated grading and curriculum design through dependency analysis.

Evaluation of LLMs on the MIT Mathematics and EECS Curriculum

Summary

The paper offers an empirical paper examining the abilities of various LLMs, focusing on their capability to process and solve academic-level questions from MIT's Mathematics and Electrical Engineering and Computer Science (EECS) curriculum. The authors curated a dataset comprising 4,550 questions and solutions from problem sets, midterm, and final exams across the curriculum, evaluating different models' abilities to meet graduation requirements.

The models under evaluation include GPT-3.5, GPT-4, StableVicuna, and LLaMA. Among these, GPT-4 demonstrates impressive performance, achieving a perfect solve rate on a curated test set when image-based questions are excluded. Techniques such as expert prompting and various prompt engineering strategies (e.g., few-shot learning, chain-of-thought) improved model performance. Notably, expert prompting emerged as a novel approach where models emulate answers that domain experts generally agree on.

Key Findings

GPT-3.5 and GPT-4 revealed varying levels of competence in addressing the provided question set. While GPT-3.5 managed to solve about one-third of the curriculum, GPT-4, especially with meticulously refined prompts, achieved complete success on a selected test set, absent images. The use of expert prompting, among other strategies, was identified as a fruitful method for enhancing performance.

A fine-tuned version of LLaMA on this dataset also exhibited improved performance, thus endorsing the utility of fine-tuning on specialized datasets. The analysis extended to benchmark comparisons, with GPT-4 outperforming other models like StableVicuna and LLaMA in both MIT's test dataset and the ReClor dataset, which consists of logical reasoning questions.

Practical and Theoretical Implications

The paper highlights the multifaceted capabilities of LLMs to facilitate academic learning, offering a dynamic tool to assist in curriculum design. By embedding course questions into vectors, the paper illustrates how dependencies among courses are discernible, offering insights for establishing prerequisites within the curriculum. This approach could enable a shift towards AI-assisted educational frameworks that coherently link various academic contents.

Furthermore, automatic grading capabilities using LLMs like GPT-4 could streamline evaluation processes, offering instructors a scalable, efficient method to assess student performance. The development of meta-questions represents another potential avenue for educational enhancement, encouraging students to critically engage with AI-generated answers to hone their evaluative skills.

Future Perspectives

The paper suggests integration rather than prohibition of LLMs in educational settings, with focus areas on course design and adaptive learning. As LLM capabilities expand, they signal substantial potential to reshape how curricula are structured and taught. There is also potential for further research into scaling these methods to graduate-level education and beyond, including more robust evaluations on datasets encompassing a broader academic spectrum.

While the current work adeptly showcases LLM capabilities in a specific curriculum context, future research should explore broader applications across different fields and institutions. As LLMs become more capable and context windows expand, integration into varied educational environments can offer greater flexibility and targeted learning support to students and educators alike.

PDF Markdown

Related Papers

Tweets

https://twitter.com/DanHendrycks/status/1891397085737431372

YouTube

Show All Videos