Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems (2410.00151v3)

Published 30 Sep 2024 in cs.CL

Abstract: Benchmarks are critical for measuring progress of math reasoning abilities of LLMs. However, existing widely-used benchmarks such as GSM8K have been rendered less useful as multiple cutting-edge LLMs achieve over 94% accuracy. While harder benchmarks have been proposed, their creation is often manual and expensive. We present Scheherazade, an automated approach for producing challenging mathematical reasoning benchmarks by logically chaining mathematical reasoning problems. We propose two different chaining methods, forward chaining and backward chaining, which require reasoning forward and backward through the chain respectively. We apply Scheherazade on GSM8K to create GSM8K-Scheherazade and evaluate 3 frontier LLMs and OpenAI's o1-preview on it. We show that while frontier models' performance declines precipitously at only a few questions chained, a preliminary evaluation suggests o1-preview performance persists up to 5 questions chained backwards. In addition, while all other models perform worse when problems are chained backwards, o1-preview performs better on backward-chained benchmarks. We will release the dataset and code publicly.

PDF HTML Abstract

Scheherazade: Evaluating LLM Math Reasoning with Chain-of-Problems

The paper under review introduces Scheherazade, a method for evaluating the mathematical reasoning abilities of LLMs through the creation of more challenging benchmarks. As existing benchmarks like GSM8K have become less informative due to high performance from advanced LLMs, Scheherazade addresses the need for more demanding evaluations.

Key Contributions

Automated Benchmark Generation: The paper presents an automated approach to construct complex benchmarks by logically chaining problems, thus creating a sequence that demands more from LLMs than independent problems.
Chaining Techniques: Two distinct methods are proposed:
- Forward Chaining: Problems are logically linked in a sequence that allows solving in the given order.
- Backward Chaining: Requires solving problems in reverse, making each earlier problem dependent on solutions from subsequent ones.
Evaluation of State-of-the-art Models: The authors apply Scheherazade to the GSM8K dataset, creating GSM8K-Scheherazade, and evaluate several leading LLMs, including OpenAI's o1-preview, GPT-4o, Meta Llama 3.1 70B, and Anthropic Claude 3.5 Sonnet.

Notable Findings

A significant performance decline is observed in models as the length of problem chains increases, highlighting the limitations in LLMs' reasoning capacities.
Remarkable is the o1-preview's performance, which is robust up to a chain length of 5 in backward chaining, surpassing other models.
Traditional CoT models exhibit more difficulty with backward chaining, suggesting a potential area for improvements in reasoning pathways.

Implications and Future Directions

The implications of this research suggest potential advancements in both the design of benchmarks and the training of LLMs. As performance on traditional benchmarks can become asymptotic quickly, Scheherazade provides a way to sustain the challenges they present, ensuring continuous progress monitoring in LLM capabilities. The duality of forward and backward chaining offers insights into diverse reasoning strategies that could inspire model improvements.

Further research could focus on:

Applying Scheherazade to more challenging datasets beyond GSM8K to explore its versatility and the enduring relevance of such benchmarks.
Investigating the use of other logical operators within chaining could yield new types of reasoning challenges for LLMs.
Developing hybrid techniques that blend forward and backward chaining to enrich model evaluation comprehensively.

Conclusion

Scheherazade emerges as a compelling tool for advancing the evaluation of mathematical reasoning in LLMs, exposing the nuanced reasoning capabilities of models like OpenAI's o1-preview. Its application reveals critical insights into the limits of current leading models and underlines areas where future LLM development can be directed.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Stephen Miner (3 papers)
Yoshiki Takashima (2 papers)
Simeng Han (20 papers)
Ferhat Erata (12 papers)
Timos Antonopoulos (13 papers)
Ruzica Piskac (24 papers)
Scott J Shapiro (1 paper)

Related Papers

Find Related Papers

Tweets

https://twitter.com/HanSineng/status/1842269397173362898

https://twitter.com/HanSineng/status/1842065755778547787