Introduction to a Novel Evaluation Paradigm for LLMs
The field of LLMs continues to progress with advancements such as GPT-4 and Claude from OpenAI and Anthropic, respectively. Alongside improvements in text generation and alignment with human values through techniques such as reinforcement learning, there remains an ongoing effort to refine the evaluation measures for these models. It is widely recognized that while math problem-solving serves as a challenging and informative benchmark for evaluating cognitive capabilities, current benchmark datasets like GSM8K tend to concentrate on final solution accuracy. This often results in an oversight of the underlying reasoning process, something the standard methodologies inadequately capture.
Evaluation Shortcomings and the 'Reason About Reasoning' Paradigm
Benchmarks like GSM8K are reaching saturation, with SOTA models surpassing 80% accuracy, diminishing their differentiating power. Hungarian high school exam results indicate a possible overfitting to benchmark patterns, questioning the broader cognitive capabilities of these models. The proposed 'reason about reasoning' framework shifts away from result-driven assessments towards a process-oriented evaluation. The novel DiagGSM8k benchmark requires models to function in a role resembling that of an educator—assessing provided solutions for correctness, pinpointing initial errors, and elaborating on these errors. This method notably differentiates model competencies far more effectively, as seen in GPT-4's markedly superior performance in the new benchmark compared to the standard GSM8K.
Evaluation Framework and Insights
The DiagGSM8K benchmark extends GSM8K to include additional challenges like Program of Thought (POT) and backward reasoning variations. Models are now tasked with confirming solution correctness and, if applicable, identifying initial errors and providing rationale—an approach more demanding than the mere replication of correct reasoning paths. The benchmark's performance statistics present sobering insights: current SOTA models struggle severely, obtaining single-digit accuracies on this more nuanced and demanding assessment framework. While they often generate superficially correct solutions, their understanding of the deep-seated logical rules is found wanting.
Experimental Assessment and Findings
When testing prominent closed-source commercial LLMs, a clear differentiation becomes evident on the DiagGSM8K benchmark. For example, GPT4 demonstrates a substantially higher adeptness in diagnosing issues than GPT3-5 and Claude2, indicating significant disparities masked by existing benchmarks. Open-source models fine-tuned on the Llama architecture, despite their GSM8K training, falter on DiagGSM8K, reinforcing the qualitative gap the new benchmark precipitates. A fine-tuning attempt using a GPT-4 generated diagnostic dataset results in an open-source model rivalling commercial counterparts in DiagGSM8K, though with a lower accuracy on the conventional GSM8K test set. This suggests that targeted training does not necessarily imply an enhanced conceptual grasp underlying the reasoning processes.
The findings emphasize the significance of the 'reason about reasoning' benchmark as a more rigorous and discriminating measure of a model's aggregate cognitive capacity. The new paradigm extends beyond computational outputs to a profound interrogation of conceptual mastery and logical operation—the crux of any pursuit towards artificial general intelligence.