Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems

Published 30 Sep 2024 in cs.CL | (2410.00151v4)

Abstract: Benchmarks are critical for measuring LLM reasoning capabilities. Some benchmarks have even become the de facto indicator of such capabilities. However, as LLM reasoning capabilities improve, existing widely-used benchmarks such as GSM8K marginally encapsulate model reasoning differentials - most state-of-the-art models for example achieve over 94% accuracy on the GSM8K dataset (paperwithcode, 2024). While constructing harder benchmarks is possible, their creation is often manual, expensive, and unscalable. As such, we present Scheherazade, an automated approach to produce large quantities of challenging mathematical reasoning benchmarks by logically chaining a small starting set of problems. We propose two different chaining methods, forward chaining and backward chaining, which include randomized branching techniques to generate complex reasoning problems. We apply Scheherazade on GSM8K to create GSM8K-Scheherazade and evaluate 3 frontier LLMs and OpenAI's o1-preview on it. We show that while other frontier models' performance declines precipitously at only a few questions chained, our evaluation suggests o1-preview's performance persists, with the flagship OpenAI model the only one to perform better at backward reasoning. Our data and code are available at https://github.com/YoshikiTakashima/scheherazade-code-data.

Summary

  • The paper presents an automated approach that constructs complex chains of mathematical problems to rigorously evaluate LLMs’ reasoning abilities.
  • It introduces both forward and backward chaining techniques, demonstrating that model performance significantly declines with increased chain length.
  • The study highlights the robust performance of the o1-preview model at chain lengths of up to 5 and reveals the limitations of traditional CoT models.

Scheherazade: Evaluating LLM Math Reasoning with Chain-of-Problems

The paper under review introduces Scheherazade, a method for evaluating the mathematical reasoning abilities of LLMs through the creation of more challenging benchmarks. As existing benchmarks like GSM8K have become less informative due to high performance from advanced LLMs, Scheherazade addresses the need for more demanding evaluations.

Key Contributions

  1. Automated Benchmark Generation: The paper presents an automated approach to construct complex benchmarks by logically chaining problems, thus creating a sequence that demands more from LLMs than independent problems.
  2. Chaining Techniques: Two distinct methods are proposed:
    • Forward Chaining: Problems are logically linked in a sequence that allows solving in the given order.
    • Backward Chaining: Requires solving problems in reverse, making each earlier problem dependent on solutions from subsequent ones.
  3. Evaluation of State-of-the-art Models: The authors apply Scheherazade to the GSM8K dataset, creating GSM8K-Scheherazade, and evaluate several leading LLMs, including OpenAI's o1-preview, GPT-4o, Meta Llama 3.1 70B, and Anthropic Claude 3.5 Sonnet.

Notable Findings

  • A significant performance decline is observed in models as the length of problem chains increases, highlighting the limitations in LLMs' reasoning capacities.
  • Remarkable is the o1-preview's performance, which is robust up to a chain length of 5 in backward chaining, surpassing other models.
  • Traditional CoT models exhibit more difficulty with backward chaining, suggesting a potential area for improvements in reasoning pathways.

Implications and Future Directions

The implications of this research suggest potential advancements in both the design of benchmarks and the training of LLMs. As performance on traditional benchmarks can become asymptotic quickly, Scheherazade provides a way to sustain the challenges they present, ensuring continuous progress monitoring in LLM capabilities. The duality of forward and backward chaining offers insights into diverse reasoning strategies that could inspire model improvements.

Further research could focus on:

  • Applying Scheherazade to more challenging datasets beyond GSM8K to explore its versatility and the enduring relevance of such benchmarks.
  • Investigating the use of other logical operators within chaining could yield new types of reasoning challenges for LLMs.
  • Developing hybrid techniques that blend forward and backward chaining to enrich model evaluation comprehensively.

Conclusion

Scheherazade emerges as a compelling tool for advancing the evaluation of mathematical reasoning in LLMs, exposing the nuanced reasoning capabilities of models like OpenAI's o1-preview. Its application reveals critical insights into the limits of current leading models and underlines areas where future LLM development can be directed.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 9 likes about this paper.