Scheherazade: Evaluating LLM Math Reasoning with Chain-of-Problems
The paper under review introduces Scheherazade, a method for evaluating the mathematical reasoning abilities of LLMs through the creation of more challenging benchmarks. As existing benchmarks like GSM8K have become less informative due to high performance from advanced LLMs, Scheherazade addresses the need for more demanding evaluations.
Key Contributions
- Automated Benchmark Generation: The paper presents an automated approach to construct complex benchmarks by logically chaining problems, thus creating a sequence that demands more from LLMs than independent problems.
- Chaining Techniques: Two distinct methods are proposed:
- Forward Chaining: Problems are logically linked in a sequence that allows solving in the given order.
- Backward Chaining: Requires solving problems in reverse, making each earlier problem dependent on solutions from subsequent ones.
- Evaluation of State-of-the-art Models: The authors apply Scheherazade to the GSM8K dataset, creating GSM8K-Scheherazade, and evaluate several leading LLMs, including OpenAI's o1-preview, GPT-4o, Meta Llama 3.1 70B, and Anthropic Claude 3.5 Sonnet.
Notable Findings
- A significant performance decline is observed in models as the length of problem chains increases, highlighting the limitations in LLMs' reasoning capacities.
- Remarkable is the o1-preview's performance, which is robust up to a chain length of 5 in backward chaining, surpassing other models.
- Traditional CoT models exhibit more difficulty with backward chaining, suggesting a potential area for improvements in reasoning pathways.
Implications and Future Directions
The implications of this research suggest potential advancements in both the design of benchmarks and the training of LLMs. As performance on traditional benchmarks can become asymptotic quickly, Scheherazade provides a way to sustain the challenges they present, ensuring continuous progress monitoring in LLM capabilities. The duality of forward and backward chaining offers insights into diverse reasoning strategies that could inspire model improvements.
Further research could focus on:
- Applying Scheherazade to more challenging datasets beyond GSM8K to explore its versatility and the enduring relevance of such benchmarks.
- Investigating the use of other logical operators within chaining could yield new types of reasoning challenges for LLMs.
- Developing hybrid techniques that blend forward and backward chaining to enrich model evaluation comprehensively.
Conclusion
Scheherazade emerges as a compelling tool for advancing the evaluation of mathematical reasoning in LLMs, exposing the nuanced reasoning capabilities of models like OpenAI's o1-preview. Its application reveals critical insights into the limits of current leading models and underlines areas where future LLM development can be directed.