Not All LLM Reasoners Are Created Equal (2410.01748v1)

Published 2 Oct 2024 in cs.LG

Abstract: We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of existing math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently. This gap is more pronounced in smaller, more cost-efficient, and math-specialized models. Moreover, instruction-tuning recipes and code generation have varying effects across LLM sizes, while finetuning on GSM can lead to task overfitting. Our analysis indicates that large reasoning gaps are not because of test-set leakage, but due to distraction from additional context and poor second-hop reasoning. Overall, LLMs exhibit systematic differences in their reasoning abilities, despite what their performance on standard benchmarks indicates.

PDF HTML Abstract

Not All LLM Reasoners Are Created Equal

The paper "Not All LLM Reasoners Are Created Equal" by Hosseini et al. provides a comprehensive evaluation of the reasoning capabilities of LLMs on grade-school math (GSM) problems. The authors introduce the concept of compositional GSM tasks to probe the robustness of LLMs when faced with chained math problems, where solving one problem is a prerequisite for solving the next. This paper offers insights into discrepancies between LLMs’ performances on standard benchmarks and their true reasoning abilities, with a specific focus on two-hop reasoning tasks.

Core Contributions

The authors introduce Compositional GSM, a novel benchmark that requires LLMs to chain together pairs of GSM8K problems. This setup tests an LLM's ability to solve the second problem using the solution of the first problem. Their key findings include:

Reasoning Gap Detection: There is a significant observed performance difference between solving problems independently and as part of a compositional pair. Notably, this gap is more pronounced in smaller, cost-efficient, and math-specialized models.
Impact of Instruction-Tuning: Instruction-tuning has varying effects across different model sizes, necessitating a reevaluation of standard training approaches. Smaller models improve more on original GSM8K tests than on compositional GSM.
Overfitting from Finetuning: Finetuning on GSM8K can lead to task-specific overfitting, improving GSM8K performance while causing a decline in compositional reasoning capability with extended training.
Code Generation vs. Natural Language Reasoning: Smaller models benefit more from generating code solutions rather than natural language step-by-step solutions, illustrating systematic differences in reasoning capabilities.

Methodology

The paper focuses on a range of state-of-the-art LLMs, including both open-weight models (e.g., LLAMA3, Mistral) and closed models (e.g., GPT-4). The authors evaluate models on three test sets: the original GSM8K, a modified version of GSM8K, and the new compositional GSM. Performance was measured with standard metrics and 8-shot prompting.

Results

The paper presents several pivotal results:

Performance Discrepancies: The authors find that most models exhibit a clear gap between their performance on GSM8K and compositional GSM. Such gaps, visualized in their figures, elucidate that even models achieving high accuracy on standard GSM8K tests may perform poorly on compositional GSM tasks.
Cost-Efficient Models: Smaller and more cost-efficient models, though performant on GSM8K, show substantial declines on compositional GSM. This presents a challenge for their practical utilization where reliable reasoning is paramount.
Instruction-Tuning: Instruction-tuned small models improve significantly on the original test sets but not on compositional tasks. This diverges in larger models where instruction tuning does help in both setups, suggesting a fundamental difference in learning dynamics between small and large models.
Math-Specialized Models: Models specialized in mathematical reasoning, like Qwen2.5-Math, similarly show reasoning gaps, indicating that specialization does not necessarily generalize to compositional tasks.
Finetuning Effects: Finetuning on either human or synthetic data for extended periods leads to task-specific overfitting, compromising the models' ability to generalize beyond familiar benchmark formats.
Code Generation: Generating code solutions enhances performance, notably for smaller models, indicating their potential reliance on natural language CoT solutions is less effective for complex reasoning tasks.

Implications and Future Directions

The findings question the validity of standard math-reasoning benchmarks as sole indicators of an LLM’s reasoning capabilities. The significant reasoning gaps, especially in smaller and specialized models, emphasize a need for developing more robust testing paradigms. Additionally, the varied impacts of instruction tuning suggest that training recipes should be revisited, particularly for smaller models.

The results also suggest that developing models that can adapt to out-of-distribution questions and handle multi-hop reasoning will likely require novel approaches beyond current instruction-tuning paradigms and reliance on superficial pattern recognition.

Conclusion

The paper by Hosseini et al. underscores the fact that high performance on typical benchmarks does not necessarily equate to robust reasoning capabilities in LLMs. The reasoning gap identified in this paper provokes further research into creating more comprehensive and challenging benchmarks. This work calls for a shift in focus towards ensuring that LLMs not only learn but understand and reason through complex, chained tasks. Future research should continue to explore and address these foundational gaps in LLM reasoning.