Extent of LLM performance attributable to reasoning vs memorized knowledge

Determine the extent to which large language models’ task performance reflects genuine reasoning as opposed to recall of memorized parametric world knowledge, by explicitly separating and measuring these contributions in controlled evaluations.

Background

The paper motivates SynthWorlds by noting that many evaluations confound genuine reasoning with memorized factual recall from pretraining. Because training data are massive and often undisclosed, benchmark scores may reflect parametric knowledge rather than reasoning ability. The authors aim to disentangle these effects via parallel real-mapped and synthetic-mapped corpora and tasks to quantify the “knowledge advantage gap.”

This open problem frames the need for methodologies that can isolate reasoning from memorization, enabling clearer scientific conclusions about LLMs’ capabilities and more reliable deployment in novel environments where prior memorized knowledge is less useful.

References

Yet, as LMs continue to be trained on massive web corpora (often with undisclosed training data), it remains unclear to what extent their performance reflects genuine reasoning versus the reciting of memorized knowledge.

— SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models (2510.24427 - Gu et al., 28 Oct 2025) in Introduction (Section 1)

Extent of LLM performance attributable to reasoning vs memorized knowledge

Sponsor

Background

References

Related Problems