Determine GPT-4o performance on complex handwritten math and visual statistics tasks

Determine how OpenAI’s GPT-4o responds to image-based inputs involving systems of hand-written equations, complicated multiple integrals, and identification of differences in medians from hand-drawn boxplots, and assess whether it produces correct and useful solutions for these tasks in the context of mathematics and statistics education.

Background

The paper demonstrates that ChatGPT4 substantially outperforms ChatGPT3.5 on statistics and data science exam questions, with the performance gap especially large when questions include images, since ChatGPT3.5 cannot process images. This motivates evaluating newer multimodal models for their image-understanding capabilities in educational contexts.

OpenAI announced GPT-4o with claims of GPT-4-level intelligence and the ability to read/upload images, and showcased a demonstration on a simple handwritten algebra problem. However, the authors explicitly note that it is unclear how GPT-4o handles more complex handwritten or visual tasks common in mathematics and statistics (e.g., systems of equations, complicated integrals, or interpreting hand-drawn boxplots). Clarifying GPT-4o’s performance on these tasks is important for understanding its utility and potential equity implications in education.

References

It is unclear how ChatGPT4o would respond to a system of hand-written equations, or a complicated multiple integral, or in determining differences in medians among a set of hand-drawn boxplots.

— Equity in the Use of ChatGPT for the Classroom: A Comparison of the Accuracy and Precision of ChatGPT 3.5 vs. ChatGPT4 with Respect to Statistics and Data Science Exams (2412.13116 - McGee et al., 17 Dec 2024) in Extensions and Future Work

Determine GPT-4o performance on complex handwritten math and visual statistics tasks

Sponsor

Background

References

Related Problems