Not All LLM Reasoners Are Created Equal
The paper "Not All LLM Reasoners Are Created Equal" by Hosseini et al. provides a comprehensive evaluation of the reasoning capabilities of LLMs on grade-school math (GSM) problems. The authors introduce the concept of compositional GSM tasks to probe the robustness of LLMs when faced with chained math problems, where solving one problem is a prerequisite for solving the next. This paper offers insights into discrepancies between LLMs’ performances on standard benchmarks and their true reasoning abilities, with a specific focus on two-hop reasoning tasks.
Core Contributions
The authors introduce Compositional GSM, a novel benchmark that requires LLMs to chain together pairs of GSM8K problems. This setup tests an LLM's ability to solve the second problem using the solution of the first problem. Their key findings include:
- Reasoning Gap Detection: There is a significant observed performance difference between solving problems independently and as part of a compositional pair. Notably, this gap is more pronounced in smaller, cost-efficient, and math-specialized models.
- Impact of Instruction-Tuning: Instruction-tuning has varying effects across different model sizes, necessitating a reevaluation of standard training approaches. Smaller models improve more on original GSM8K tests than on compositional GSM.
- Overfitting from Finetuning: Finetuning on GSM8K can lead to task-specific overfitting, improving GSM8K performance while causing a decline in compositional reasoning capability with extended training.
- Code Generation vs. Natural Language Reasoning: Smaller models benefit more from generating code solutions rather than natural language step-by-step solutions, illustrating systematic differences in reasoning capabilities.
Methodology
The paper focuses on a range of state-of-the-art LLMs, including both open-weight models (e.g., LLAMA3, Mistral) and closed models (e.g., GPT-4). The authors evaluate models on three test sets: the original GSM8K, a modified version of GSM8K, and the new compositional GSM. Performance was measured with standard metrics and 8-shot prompting.
Results
The paper presents several pivotal results:
- Performance Discrepancies: The authors find that most models exhibit a clear gap between their performance on GSM8K and compositional GSM. Such gaps, visualized in their figures, elucidate that even models achieving high accuracy on standard GSM8K tests may perform poorly on compositional GSM tasks.
- Cost-Efficient Models: Smaller and more cost-efficient models, though performant on GSM8K, show substantial declines on compositional GSM. This presents a challenge for their practical utilization where reliable reasoning is paramount.
- Instruction-Tuning: Instruction-tuned small models improve significantly on the original test sets but not on compositional tasks. This diverges in larger models where instruction tuning does help in both setups, suggesting a fundamental difference in learning dynamics between small and large models.
- Math-Specialized Models: Models specialized in mathematical reasoning, like Qwen2.5-Math, similarly show reasoning gaps, indicating that specialization does not necessarily generalize to compositional tasks.
- Finetuning Effects: Finetuning on either human or synthetic data for extended periods leads to task-specific overfitting, compromising the models' ability to generalize beyond familiar benchmark formats.
- Code Generation: Generating code solutions enhances performance, notably for smaller models, indicating their potential reliance on natural language CoT solutions is less effective for complex reasoning tasks.
Implications and Future Directions
The findings question the validity of standard math-reasoning benchmarks as sole indicators of an LLM’s reasoning capabilities. The significant reasoning gaps, especially in smaller and specialized models, emphasize a need for developing more robust testing paradigms. Additionally, the varied impacts of instruction tuning suggest that training recipes should be revisited, particularly for smaller models.
The results also suggest that developing models that can adapt to out-of-distribution questions and handle multi-hop reasoning will likely require novel approaches beyond current instruction-tuning paradigms and reliance on superficial pattern recognition.
Conclusion
The paper by Hosseini et al. underscores the fact that high performance on typical benchmarks does not necessarily equate to robust reasoning capabilities in LLMs. The reasoning gap identified in this paper provokes further research into creating more comprehensive and challenging benchmarks. This work calls for a shift in focus towards ensuring that LLMs not only learn but understand and reason through complex, chained tasks. Future research should continue to explore and address these foundational gaps in LLM reasoning.