Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems? (2504.00509v2)
Abstract: The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years have weaved a miracle for researchers that we are only inches away from surpassing human intelligence. However, is the LLMs' remarkable reasoning ability indeed comes from true intelligence by human standards, or are they simply reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM's recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer $60\%$ performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to re-evaluate the true intelligence level of cutting-edge LLMs.
Summary
- The paper demonstrates that LLMs rely on memorized solution templates rather than genuine reasoning, leading to significant performance drops when problems are subtly altered.
- The RoR-Bench benchmark reveals that minor semantic changes produce over 60% accuracy loss on text-based tasks and notable declines in vision-language models.
- Mitigation approaches like Chain-of-Thought and in-context learning provide limited improvements, underscoring the deep-rooted recitation bias in current LLM architectures.
The paper "Recitation over Reasoning: How Cutting-Edge LLMs Can Fail on Elementary School-Level Reasoning Problems?" (2504.00509) investigates the phenomenon where state-of-the-art LLMs exhibit significant performance degradation on simple reasoning tasks when confronted with subtle modifications to familiar problem structures. The core argument is that these models often rely on retrieving and applying learned solution templates (recitation) associated with common problem patterns encountered during training, rather than engaging in genuine, condition-specific reasoning. This reliance on recitation leads to failures when minor changes invalidate the standard solution path.
Recitation vs. Reasoning in LLMs
The paper posits that LLMs, trained on vast internet-scale datasets, excel at identifying and replicating solutions for frequently occurring problems, such as standard elementary school arithmetic word problems or classic puzzles. This process is termed "recitation": the model recognizes a problem's superficial structure, maps it to a known paradigm learned from training data, and executes the associated solution steps, often without fully parsing or integrating the specific nuances of the current instance. This contrasts with "reasoning," which would involve a deeper semantic understanding of the problem statement, careful consideration of all stated conditions, identification of deviations from standard assumptions, and construction of a bespoke solution based on the unique parameters provided. The hypothesis is that high performance on many benchmarks may stem significantly from this recitation capability rather than from robust, generalizable reasoning mechanisms.
The RoR-Bench Methodology
To empirically test this hypothesis, the authors developed the RoR-Bench benchmark. This benchmark comprises pairs of problems:
- Original Problem: A standard, often elementary-level, reasoning or arithmetic problem that current high-performing LLMs typically solve correctly. These problems are assumed to resemble instances likely present in the training corpora.
- Modified Problem: A variation of the original problem where a single word, phrase, or condition is subtly altered. This modification is designed to be semantically significant, fundamentally changing the required logical steps or rendering the standard solution template invalid, while maintaining high surface similarity to the original problem.
The core experimental design involves comparing LLM performance on the original problems against their performance on the corresponding modified versions. A significant drop in accuracy on the modified problems, despite their similarity to the solvable original ones, is interpreted as evidence of reliance on recitation over reasoning. The benchmark includes both text-based problems (arithmetic, logic puzzles) and multimodal problems involving images (evaluated using Vision-LLMs, VLMs).
Empirical Evidence of Performance Degradation
The paper presents compelling quantitative results demonstrating the fragility of LLM reasoning under subtle conditional shifts.
- Text-Based LLMs: On elementary arithmetic and reasoning problems within RoR-Bench, leading models such as OpenAI-o1 and DeepSeek-R1 exhibited performance drops frequently exceeding 60% when moving from the original to the modified problem variants (Table 1). This sharp decline occurred despite the models' high accuracy on the original, standard forms of these problems.
- Vision-LLMs (VLMs): Similar trends were observed for VLMs evaluated on image-based reasoning tasks within RoR-Bench. The average performance drop across tested VLMs was over 35% when faced with modified image-based problems compared to their original counterparts (Table 2).
- Failure Mechanism: Analysis of model outputs (Fig. 1, Fig. 2, Appendix A.1) indicates that failures on modified problems often result from the model applying the standard solution logic associated with the original problem structure, effectively ignoring the critical change introduced by the modification. For example, a model might calculate relative speed for objects moving apart if the standard template involves objects moving towards each other, or incorrectly factor in boat speed in still water when the problem states the boat is merely drifting with the current.
- Chain-of-Thought (CoT): The paper found that employing Chain-of-Thought prompting or enabling longer generation processes did not substantially mitigate the performance drop on modified problems (Section 4.1.1, Table 1). While CoT sometimes improved performance on original problems, its effect on the modified, recitation-prone tasks was minimal, suggesting the failure occurs early in problem interpretation rather than during the step-by-step deduction process.
Robustness Checks and Mitigation Attempts
The researchers conducted further experiments to probe the nature of these failures and explore potential mitigation strategies.
- "Forced Correct" (FC) Prompt: To rule out the possibility that LLMs were simply interpreting the modifications as typos and "auto-correcting" back to the standard problem, an explicit instruction was added: "Note: the problem is correct, please strictly follow the literal meaning of the problem for reasoning." This FC prompt, however, yielded only marginal improvements on the modified problems, with the average accuracy drop for text-based tasks remaining above 45% (Table 1). This suggests the recitation behavior is more deeply ingrained than simple input correction. An interesting side-effect was that the FC prompt sometimes decreased accuracy on the original problems, potentially by inducing over-analysis or questioning standard implicit assumptions (Appendix A.5).
- "Mental Seal of Solvability": A specific category of modified problems involved introducing conditions that rendered the problem unsolvable (e.g., asking for the smoke direction of an electric train, using contradictory measurements). LLMs performed exceptionally poorly on these, with initial accuracy often below 15% (Table 4). Models exhibited a strong tendency to force a solution, making unfounded assumptions or misapplying formulas, rather than identifying the inherent contradiction. This suggests a bias or "mental seal" towards assuming problems presented are inherently solvable using standard methods.
- In-Context Learning (ICL): Providing a single example (1-shot ICL) demonstrating how to handle a modified or unsolvable problem provided only limited gains (Section 4.2, Table 3). While 1-shot ICL improved performance compared to the zero-shot baseline, the accuracy on modified problems remained significantly lower than on original problems, and the improvement was insufficient to close the large performance gap. Furthermore, the effectiveness of ICL varied significantly across different models and problem types. For unsolvable problems, even with a 1-shot example and the FC prompt, many models, particularly less powerful ones, struggled to overcome the "solvability" bias (Section 4.3, Table 4).
Implications for LLM Evaluation and Reliability
The findings presented in the paper carry significant implications for the assessment and deployment of LLMs:
- Benchmark Limitations: The results challenge the interpretation of high scores on existing benchmarks, suggesting they might overestimate robust reasoning abilities by potentially containing many problems whose structures are "memorized" or recited from training data. Performance may not generalize reliably to variations outside these learned patterns.
- Robustness Concerns: The extreme sensitivity to minor, semantically crucial changes in problem statements highlights a fundamental lack of robustness in current LLM reasoning processes, even for elementary tasks. This brittleness poses risks for real-world applications where inputs may deviate slightly from canonical forms.
- Reliability in Critical Tasks: The demonstrated failures in accurately interpreting and reasoning under slightly altered conditions, particularly the inability to reliably identify impossible scenarios, raise concerns about deploying these models in domains requiring high fidelity, safety, or nuanced understanding where precise condition handling is paramount.
- Fundamental Challenge: The persistence of recitation behavior despite interventions like CoT, FC prompts, and ICL suggests it is a deep-seated issue, possibly linked to the core mechanisms of pattern matching and sequence prediction inherent in transformer architectures trained on massive, repetitive datasets. Overcoming this may require more fundamental architectural or training paradigm shifts.
In conclusion, the paper provides strong empirical evidence that cutting-edge LLMs can exhibit severe failures on elementary reasoning problems due to a preference for reciting familiar solution patterns over adaptively reasoning based on specific, potentially modified, problem conditions. The dramatic performance drops observed on the RoR-Bench benchmark under subtle variations, the difficulty in mitigating this behavior, and the failure to recognize logically impossible scenarios highlight limitations in the robustness and depth of reasoning currently achieved by these models.