AI Research Assistant for Computer Scientists

Papers
Topics
Authors
Recent
2000 character limit reached
MR-GSM8K: A Meta-Reasoning Revolution in Large Language Model Evaluation (2312.17080)
Published 28 Dec 2023 in cs.CL
MR-GSM8K: A Meta-Reasoning Revolution in Large Language Model Evaluation

Overview

  • The paper introduces a new evaluation paradigm for LLMs focusing on the reasoning process rather than just solution accuracy.

  • Existing benchmarks like GSM8K are becoming less effective as state-of-the-art models exceed 80% accuracy, indicating a need for more nuanced assessments.

  • The proposed DiagGSM8K benchmark requires models to emulate the role of an educator, challenging them to identify and explain errors in provided solutions.

  • Current models, including GPT-4, show markedly lower performance on DiagGSM8K, struggling with deeper logical understanding despite high accuracies on GSM8K.

  • The paper suggests that process-oriented training on DiagGSM8K enhances the models, but does not guarantee a broader comprehension of underlying cognitive processes.

Introduction to a Novel Evaluation Paradigm for LLMs

The field of LLMs continues to progress with advancements such as GPT-4 and Claude from OpenAI and Anthropic, respectively. Alongside improvements in text generation and alignment with human values through techniques such as reinforcement learning, there remains an ongoing effort to refine the evaluation measures for these models. It is widely recognized that while math problem-solving serves as a challenging and informative benchmark for evaluating cognitive capabilities, current benchmark datasets like GSM8K tend to concentrate on final solution accuracy. This often results in an oversight of the underlying reasoning process, something the standard methodologies inadequately capture.

Evaluation Shortcomings and the 'Reason About Reasoning' Paradigm

Benchmarks like GSM8K are reaching saturation, with SOTA models surpassing 80% accuracy, diminishing their differentiating power. Hungarian high school exam results indicate a possible overfitting to benchmark patterns, questioning the broader cognitive capabilities of these models. The proposed 'reason about reasoning' framework shifts away from result-driven assessments towards a process-oriented evaluation. The novel DiagGSM8k benchmark requires models to function in a role resembling that of an educator—assessing provided solutions for correctness, pinpointing initial errors, and elaborating on these errors. This method notably differentiates model competencies far more effectively, as seen in GPT-4's markedly superior performance in the new benchmark compared to the standard GSM8K.

Evaluation Framework and Insights

The DiagGSM8K benchmark extends GSM8K to include additional challenges like Program of Thought (POT) and backward reasoning variations. Models are now tasked with confirming solution correctness and, if applicable, identifying initial errors and providing rationale—an approach more demanding than the mere replication of correct reasoning paths. The benchmark's performance statistics present sobering insights: current SOTA models struggle severely, obtaining single-digit accuracies on this more nuanced and demanding assessment framework. While they often generate superficially correct solutions, their understanding of the deep-seated logical rules is found wanting.

Experimental Assessment and Findings

When testing prominent closed-source commercial LLMs, a clear differentiation becomes evident on the DiagGSM8K benchmark. For example, GPT4 demonstrates a substantially higher adeptness in diagnosing issues than GPT3-5 and Claude2, indicating significant disparities masked by existing benchmarks. Open-source models fine-tuned on the Llama architecture, despite their GSM8K training, falter on DiagGSM8K, reinforcing the qualitative gap the new benchmark precipitates. A fine-tuning attempt using a GPT-4 generated diagnostic dataset results in an open-source model rivalling commercial counterparts in DiagGSM8K, though with a lower accuracy on the conventional GSM8K test set. This suggests that targeted training does not necessarily imply an enhanced conceptual grasp underlying the reasoning processes.

The findings emphasize the significance of the 'reason about reasoning' benchmark as a more rigorous and discriminating measure of a model's aggregate cognitive capacity. The new paradigm extends beyond computational outputs to a profound interrogation of conceptual mastery and logical operation—the crux of any pursuit towards artificial general intelligence.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhongshen Zeng (4 papers)
  2. Pengguang Chen (20 papers)
  3. Haiyun Jiang (31 papers)
  4. Jiaya Jia (142 papers)
  5. Shu Liu (105 papers)