SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models (2307.10635v3)

Published 20 Jul 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Most of the existing LLM benchmarks on scientific problem reasoning focus on problems grounded in high-school subjects and are confined to elementary algebraic operations. To systematically examine the reasoning capabilities required for solving complex scientific problems, we introduce an expansive benchmark suite SciBench for LLMs. SciBench contains a carefully curated dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms the others and some strategies that demonstrate improvements in certain problem-solving skills could result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.

PDF Abstract

Evaluative Analysis of "SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of LLMs"

The paper "SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of LLMs" by Xiaoxuan Wang et al. addresses a crucial component in the evaluation of LLMs by presenting a benchmark suite, SciBench, designed to assess the models' problem-solving skills at a collegiate level. Unlike prior benchmarks, which predominantly emphasize basic arithmetic operations within high-school contexts, SciBench is focused on advancing the evaluation of LLMs by incorporating complex scientific problems across mathematics, chemistry, and physics, inherent to undergraduate curricula.

Comprehensive Dataset and Goal

SciBench diverges from existing datasets through its inclusion of diverse and integrative problem types, including both textual and multimodal questions that incorporate visual data such as graphs and diagrams, thereby challenging the LLMs in more sophisticated ways. The dataset consists of 789 open-ended questions that necessitate multi-step reasoning and the application of domain-specific knowledge such as equations, theorems, and complex numerical computations, reflecting real-world demands in scientific domains. Furthermore, an additional subset from actual college exams expands the scope, aiming to realistically assess LLM capabilities within secure data unattainable from common internet sources.

Benchmark Findings and Model Insights

The authors provide a comprehensive comparative evaluation of various unimodal and multimodal LLMs under assorted prompting strategies such as zero-shot learning, few-shot learning, and the use of external computation tools like Python and Wolfram Language. Among noteworthy findings, no single prompting methodology emerged as universally superior, as strategic improvements in specific problem-solving areas often incurred declines in others. This underscores the intricate challenge in configuring LLMs for universal problem-solving aptitude. Impressively, even with enhanced configurations like Chain-of-Thought (CoT) prompting and external tool support, the peak model performance achieved was an average score of 43.22% on the textual dataset—a testament to SciBench’s calibrated complexity.

Additionally, the experimental results suggested considerable variability in performance depending on the model's architecture, with proprietary models such as GPT-4-Turbo generally outperforming open-source models like LLaMA-2. However, the performance on multimodal contexts remained predominantly low across all tested models, highlighting an area ripe for further innovation and research.

Evaluation of Skills and Error Profiling

A significant contribution of this paper is its novel approach to skill evaluation through a self-refinement method that categorizes LLM errors into ten distinct problem-solving skills. These skills encapsulate varied aspects of scientific reasoning, including logical decomposition, causal reasoning, and spatial perception. Automated categorization via LLM verifiers guided by these skills allows for large-scale, nuanced analyses of model performance, facilitating the identification of specific skill deficiencies. Crucially, this scalable method offers an efficient alternative to costly manual evaluation, positioning it as a potential standard in subsequent research.

Implications for Future AI Developments

The implications for future AI developments are multifaceted. Firstly, acknowledging the current LLM limitations as demonstrated by SciBench, there is a clear impetus for advancements in model architectures that support nuanced understanding and manipulation of complex, multimodal scientific problems. Secondly, enhancing prompt engineering—particularly the balance between prompting strategies and computational tool integration—could yield favorable refinements in LLM capabilities. Lastly, SciBench sets a precedent for future benchmarks aiming to ensure AI systems are robust, not only in comprehending textual data but also integrating diverse informational formats to solve sophisticated scientific inquiries.

In conclusion, "SciBench" provides a critical lens through which to evaluate and stress test the problem-solving competencies of LLMs. The paper systematically advances the understanding of LLM limitations and sets a rigorous standard for future AI evaluation methods, catalyzing advancements that could significantly elevate AI's role in educational contexts and beyond.