Quantifying memorization versus generalized reasoning in LLM mathematical problem solving

Determine the proportion of apparent mathematical reasoning exhibited by large language models that is attributable to memorization of training data and shallow heuristics, as opposed to learned general principles of reasoning that generalize beyond the training examples when solving high-school-level word problems.

Background

LLMs have been observed to produce seemingly non-trivial reasoning, especially under chain-of-thought prompting, despite being primarily trained for next-token prediction. This raises a fundamental question about the mechanisms underpinning their performance on mathematical word problems.

The paper evaluates eight contemporary LLMs on 50 newly constructed high-school-level problems and analyzes both answers and solution steps. While models demonstrate broad mathematical knowledge, they also show systematic reasoning failures. Despite these empirical observations, the underlying balance between memorization and genuine generalization remains unresolved and is explicitly noted by the authors as unclear.

References

However, it is not clear how much of these apparent reasoning capabilities can be attributed to memorization of the training material combined with shallow heuristics, as opposed to having learned actual general principles of reasoning by generalizing from the training examples.

Large Language Models and Mathematical Reasoning Failures  (2502.11574 - Boye et al., 17 Feb 2025) in Section 1 (Introduction)