Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Report on the llms evaluating the high school questions (2505.00057v1)

Published 30 Apr 2025 in cs.CL

Abstract: This report aims to evaluate the performance of LLMs in solving high school science questions and to explore their potential applications in the educational field. With the rapid development of LLMs in the field of natural language processing, their application in education has attracted widespread attention. This study selected mathematics exam questions from the college entrance examinations (2019-2023) as evaluation data and utilized at least eight LLM APIs to provide answers. A comprehensive assessment was conducted based on metrics such as accuracy, response time, logical reasoning, and creativity. Through an in-depth analysis of the evaluation results, this report reveals the strengths and weaknesses of LLMs in handling high school science questions and discusses their implications for educational practice. The findings indicate that although LLMs perform excellently in certain aspects, there is still room for improvement in logical reasoning and creative problem-solving. This report provides an empirical foundation for further research and application of LLMs in the educational field and offers suggestions for improvement.

Summary

Evaluation of LLMs in Solving High School Mathematics Questions

This paper offers a rigorous evaluation of LLMs when applied to high school mathematics questions, specifically drawing from college entrance examination papers from 2019 to 2023. The primary objective is to quantify the potential of LLMs in educational applications, particularly in solving science questions. Utilizing eight varied LLMs, the paper investigates a spectrum of performance metrics including accuracy, response time, logical reasoning, and creativity.

Evaluation Methodology

The research comprehensively evaluates LLM performance using a well-structured dataset comprising various question types—multiple-choice, fill-in-the-blank, and comprehensive questions—categorized by difficulty levels. Models such as GLM-4-Flash, ERNIE-Speed-128K, and Qwen2.5-7B-Instruct were selected to provide a diverse representation of both domestic and international technologies.

Data Processing and Classification

Each question was methodically arranged in a JSON format ensuring detailed descriptions and standard answers stored in LaTeX. The responses from different LLMs were assessed both in single-solution and multi-solution scenarios. Evaluative conclusions were drawn from meticulously processing the data using both AI-assisted assessment models and expert evaluations.

Key Findings in LLM Performance

  1. Accuracy: Among the models evaluated, Qwen2.5-7B-Instruct and ERNIE-Speed-128K exhibited superior performance in overall accuracy for comprehensive questions. However, accuracy varied significantly across question types and difficulty levels. For instance, models like Hunyuan-lite were found to excel in fill-in-the-blank questions, while Yi-34B displayed the lowest performance across several metrics.
  2. Response Time: The evaluation revealed notable differences in model speed for processing various types of questions. Models like Qwen2.5-7B-Instruct delivered optimal performance across simpler problem sets, whereas GLM-4-Flash demonstrated the longest response time in question-and-answer tasks. The processing time was crucially assessed using the 95th percentile response time to ensure precision.
  3. Logical Reasoning and Guidance: Logical reasoning was assessed using mechanisms such as multi-round chain-of-thought prompting. Models like Hunyuan-lite required fewer attempts to guide incorrect questions towards correct answers, showcasing efficient logical reasoning capabilities. Moreover, the guiding words' impact was tested with varied prompts, influencing both accuracy and speed improvements, as seen significantly in Spark-lite.
  4. Creativity: Evaluating the creative output through multi-solution scenarios, the paper underscores differences in solution richness and complexity. GLM-4-Flash led in providing diverse and complex solutions, while some models like Yi-34B showed limitations in generating varied outputs.

Implications and Future Directions

The research underscores the growing capability of LLMs in addressing educational challenges, yet identifies areas needing enhancement, particularly in logical reasoning and creative problem-solving. Enhancements in training algorithms and datasets could potentially elevate model creativity and logical processing. The paper provides empirical evidence advocating for personalized learning modules leveraging LLM capabilities, catering to diverse student proficiency levels.

The paper recognizes the nascent stage of LLM application in the educational domain and suggests further exploration into their utility across other scientific disciplines like biology and physics. Continuous advancements in LLM technology warrant ongoing evaluation to harness their potential for long-term educational practices.

By emphasizing both the strengths and areas for improvement in current LLM applications, this research contributes substantially to the roadmap for integrating AI-driven solutions in educational systems, promoting more tailored and efficient learning methodologies.