Emergent Mind


In this work, we introduce a novel evaluation paradigm for Large Language Models, one that challenges them to engage in meta-reasoning. This approach addresses critical shortcomings in existing math problem-solving benchmarks, traditionally used to evaluate the cognitive capabilities of agents. Our paradigm shifts the focus from result-oriented assessments, which often overlook the reasoning process, to a more holistic evaluation that effectively differentiates the cognitive capabilities among models. For example, in our benchmark, GPT-4 demonstrates a performance five times better than GPT3-5. The significance of this new paradigm lies in its ability to reveal potential cognitive deficiencies in LLMs that current benchmarks, such as GSM8K, fail to uncover due to their saturation and lack of effective differentiation among varying reasoning abilities. Our comprehensive analysis includes several state-of-the-art math models from both open-source and closed-source communities, uncovering fundamental deficiencies in their training and evaluation approaches. This paper not only advocates for a paradigm shift in the assessment of LLMs but also contributes to the ongoing discourse on the trajectory towards Artificial General Intelligence (AGI). By promoting the adoption of meta-reasoning evaluation methods similar to ours, we aim to facilitate a more accurate assessment of the true cognitive abilities of LLMs.


  • The paper introduces a new evaluation paradigm for LLMs focusing on the reasoning process rather than just solution accuracy.

  • Existing benchmarks like GSM8K are becoming less effective as state-of-the-art models exceed 80% accuracy, indicating a need for more nuanced assessments.

  • The proposed DiagGSM8K benchmark requires models to emulate the role of an educator, challenging them to identify and explain errors in provided solutions.

  • Current models, including GPT-4, show markedly lower performance on DiagGSM8K, struggling with deeper logical understanding despite high accuracies on GSM8K.

  • The paper suggests that process-oriented training on DiagGSM8K enhances the models, but does not guarantee a broader comprehension of underlying cognitive processes.

Introduction to a Novel Evaluation Paradigm for LLMs

The field of LLMs continues to progress with advancements such as GPT-4 and Claude from OpenAI and Anthropic, respectively. Alongside improvements in text generation and alignment with human values through techniques such as reinforcement learning, there remains an ongoing effort to refine the evaluation measures for these models. It is widely recognized that while math problem-solving serves as a challenging and informative benchmark for evaluating cognitive capabilities, current benchmark datasets like GSM8K tend to concentrate on final solution accuracy. This often results in an oversight of the underlying reasoning process, something the standard methodologies inadequately capture.

Evaluation Shortcomings and the 'Reason About Reasoning' Paradigm

Benchmarks like GSM8K are reaching saturation, with SOTA models surpassing 80% accuracy, diminishing their differentiating power. Hungarian high school exam results indicate a possible overfitting to benchmark patterns, questioning the broader cognitive capabilities of these models. The proposed 'reason about reasoning' framework shifts away from result-driven assessments towards a process-oriented evaluation. The novel DiagGSM8k benchmark requires models to function in a role resembling that of an educator—assessing provided solutions for correctness, pinpointing initial errors, and elaborating on these errors. This method notably differentiates model competencies far more effectively, as seen in GPT-4's markedly superior performance in the new benchmark compared to the standard GSM8K.

Evaluation Framework and Insights

The DiagGSM8K benchmark extends GSM8K to include additional challenges like Program of Thought (POT) and backward reasoning variations. Models are now tasked with confirming solution correctness and, if applicable, identifying initial errors and providing rationale—an approach more demanding than the mere replication of correct reasoning paths. The benchmark's performance statistics present sobering insights: current SOTA models struggle severely, obtaining single-digit accuracies on this more nuanced and demanding assessment framework. While they often generate superficially correct solutions, their understanding of the deep-seated logical rules is found wanting.

Experimental Assessment and Findings

When testing prominent closed-source commercial LLMs, a clear differentiation becomes evident on the DiagGSM8K benchmark. For example, GPT4 demonstrates a substantially higher adeptness in diagnosing issues than GPT3-5 and Claude2, indicating significant disparities masked by existing benchmarks. Open-source models fine-tuned on the Llama architecture, despite their GSM8K training, falter on DiagGSM8K, reinforcing the qualitative gap the new benchmark precipitates. A fine-tuning attempt using a GPT-4 generated diagnostic dataset results in an open-source model rivalling commercial counterparts in DiagGSM8K, though with a lower accuracy on the conventional GSM8K test set. This suggests that targeted training does not necessarily imply an enhanced conceptual grasp underlying the reasoning processes.

The findings emphasize the significance of the 'reason about reasoning' benchmark as a more rigorous and discriminating measure of a model's aggregate cognitive capacity. The new paradigm extends beyond computational outputs to a profound interrogation of conceptual mastery and logical operation—the crux of any pursuit towards artificial general intelligence.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

Test Your Knowledge

You answered out of questions correctly.

Well done!