Evaluation of LLMs in Solving High School Mathematics Questions
This paper offers a rigorous evaluation of LLMs when applied to high school mathematics questions, specifically drawing from college entrance examination papers from 2019 to 2023. The primary objective is to quantify the potential of LLMs in educational applications, particularly in solving science questions. Utilizing eight varied LLMs, the paper investigates a spectrum of performance metrics including accuracy, response time, logical reasoning, and creativity.
Evaluation Methodology
The research comprehensively evaluates LLM performance using a well-structured dataset comprising various question types—multiple-choice, fill-in-the-blank, and comprehensive questions—categorized by difficulty levels. Models such as GLM-4-Flash, ERNIE-Speed-128K, and Qwen2.5-7B-Instruct were selected to provide a diverse representation of both domestic and international technologies.
Data Processing and Classification
Each question was methodically arranged in a JSON format ensuring detailed descriptions and standard answers stored in LaTeX. The responses from different LLMs were assessed both in single-solution and multi-solution scenarios. Evaluative conclusions were drawn from meticulously processing the data using both AI-assisted assessment models and expert evaluations.
- Accuracy: Among the models evaluated, Qwen2.5-7B-Instruct and ERNIE-Speed-128K exhibited superior performance in overall accuracy for comprehensive questions. However, accuracy varied significantly across question types and difficulty levels. For instance, models like Hunyuan-lite were found to excel in fill-in-the-blank questions, while Yi-34B displayed the lowest performance across several metrics.
- Response Time: The evaluation revealed notable differences in model speed for processing various types of questions. Models like Qwen2.5-7B-Instruct delivered optimal performance across simpler problem sets, whereas GLM-4-Flash demonstrated the longest response time in question-and-answer tasks. The processing time was crucially assessed using the 95th percentile response time to ensure precision.
- Logical Reasoning and Guidance: Logical reasoning was assessed using mechanisms such as multi-round chain-of-thought prompting. Models like Hunyuan-lite required fewer attempts to guide incorrect questions towards correct answers, showcasing efficient logical reasoning capabilities. Moreover, the guiding words' impact was tested with varied prompts, influencing both accuracy and speed improvements, as seen significantly in Spark-lite.
- Creativity: Evaluating the creative output through multi-solution scenarios, the paper underscores differences in solution richness and complexity. GLM-4-Flash led in providing diverse and complex solutions, while some models like Yi-34B showed limitations in generating varied outputs.
Implications and Future Directions
The research underscores the growing capability of LLMs in addressing educational challenges, yet identifies areas needing enhancement, particularly in logical reasoning and creative problem-solving. Enhancements in training algorithms and datasets could potentially elevate model creativity and logical processing. The paper provides empirical evidence advocating for personalized learning modules leveraging LLM capabilities, catering to diverse student proficiency levels.
The paper recognizes the nascent stage of LLM application in the educational domain and suggests further exploration into their utility across other scientific disciplines like biology and physics. Continuous advancements in LLM technology warrant ongoing evaluation to harness their potential for long-term educational practices.
By emphasizing both the strengths and areas for improvement in current LLM applications, this research contributes substantially to the roadmap for integrating AI-driven solutions in educational systems, promoting more tailored and efficient learning methodologies.