MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation (2312.17080v4)

Published 28 Dec 2023 in cs.CL

Abstract: In this work, we introduce a novel evaluation paradigm for LLMs that compels them to transition from a traditional question-answering role, akin to a student, to a solution-scoring role, akin to a teacher. This paradigm, focusing on "reasoning about reasoning," hence termed meta-reasoning, shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation that effectively distinguishes between the cognitive capabilities of different models. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark. Our extensive analysis includes several state-of-the-art models from both open-source and commercial domains, uncovering fundamental deficiencies in their training and evaluation methodologies. Notably, while models like Deepseek-v2 and Claude3-Sonnet closely competed with GPT-4 in GSM8K, their performance disparities expanded dramatically in MR-GSM8K, with differences widening to over 20 absolute points, underscoring the significant challenge posed by our meta-reasoning approach.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces a 'reason about reasoning' framework that evaluates LLMs based on their process and error identification rather than just final answers.
It extends the traditional GSM8K with DiagGSM8K, incorporating challenges like Program of Thought and backward reasoning to more effectively differentiate model capabilities.
Experimental findings show that models such as GPT-4 excel in diagnostic tasks, while open-source alternatives struggle without targeted training, revealing critical evaluation gaps.

Introduction to a Novel Evaluation Paradigm for LLMs

The field of LLMs continues to progress with advancements such as GPT-4 and Claude from OpenAI and Anthropic, respectively. Alongside improvements in text generation and alignment with human values through techniques such as reinforcement learning, there remains an ongoing effort to refine the evaluation measures for these models. It is widely recognized that while math problem-solving serves as a challenging and informative benchmark for evaluating cognitive capabilities, current benchmark datasets like GSM8K tend to concentrate on final solution accuracy. This often results in an oversight of the underlying reasoning process, something the standard methodologies inadequately capture.

Evaluation Shortcomings and the 'Reason About Reasoning' Paradigm

Benchmarks like GSM8K are reaching saturation, with SOTA models surpassing 80% accuracy, diminishing their differentiating power. Hungarian high school exam results indicate a possible overfitting to benchmark patterns, questioning the broader cognitive capabilities of these models. The proposed 'reason about reasoning' framework shifts away from result-driven assessments towards a process-oriented evaluation. The novel DiagGSM8k benchmark requires models to function in a role resembling that of an educator—assessing provided solutions for correctness, pinpointing initial errors, and elaborating on these errors. This method notably differentiates model competencies far more effectively, as seen in GPT-4's markedly superior performance in the new benchmark compared to the standard GSM8K.

Evaluation Framework and Insights

The DiagGSM8K benchmark extends GSM8K to include additional challenges like Program of Thought (POT) and backward reasoning variations. Models are now tasked with confirming solution correctness and, if applicable, identifying initial errors and providing rationale—an approach more demanding than the mere replication of correct reasoning paths. The benchmark's performance statistics present sobering insights: current SOTA models struggle severely, obtaining single-digit accuracies on this more nuanced and demanding assessment framework. While they often generate superficially correct solutions, their understanding of the deep-seated logical rules is found wanting.

Experimental Assessment and Findings

When testing prominent closed-source commercial LLMs, a clear differentiation becomes evident on the DiagGSM8K benchmark. For example, GPT4 demonstrates a substantially higher adeptness in diagnosing issues than GPT3-5 and Claude2, indicating significant disparities masked by existing benchmarks. Open-source models fine-tuned on the Llama architecture, despite their GSM8K training, falter on DiagGSM8K, reinforcing the qualitative gap the new benchmark precipitates. A fine-tuning attempt using a GPT-4 generated diagnostic dataset results in an open-source model rivalling commercial counterparts in DiagGSM8K, though with a lower accuracy on the conventional GSM8K test set. This suggests that targeted training does not necessarily imply an enhanced conceptual grasp underlying the reasoning processes.

The findings emphasize the significance of the 'reason about reasoning' benchmark as a more rigorous and discriminating measure of a model's aggregate cognitive capacity. The new paradigm extends beyond computational outputs to a profound interrogation of conceptual mastery and logical operation—the crux of any pursuit towards artificial general intelligence.

PDF Markdown

Related Papers

Tweets

https://twitter.com/susumuota/status/1745960169039630829

https://twitter.com/Ruiss1/status/1749682750356164801