On Memorization of Large Language Models in Logical Reasoning (2410.23123v1)

Published 30 Oct 2024 in cs.CL

Abstract: LLMs achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. This contrasting behavior is puzzling when it comes to understanding the mechanisms behind LLMs' reasoning capabilities. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using a dynamically generated logical reasoning benchmark based on Knights and Knaves (K&K) puzzles. We found that LLMs could interpolate the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet fail when those puzzles are slightly perturbed, suggesting that the models heavily rely on memorization to solve those training puzzles. On the other hand, we show that while fine-tuning leads to heavy memorization, it also consistently improves generalization performance. In-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers suggest that the LLMs learn to reason on K&K puzzles despite training data memorization. This phenomenon indicates that LLMs exhibit a complex interplay between memorization and genuine reasoning abilities. Finally, our analysis with per-sample memorization score sheds light on how LLMs switch between reasoning and memorization in solving logical puzzles. Our code and data are available at https://memkklogic.github.io.

PDF HTML Abstract

Understanding Memorization and Reasoning in LLMs

The paper "On Memorization of LLMs in Logical Reasoning" explores the interplay between memorization and genuine reasoning in LLMs, particularly in the context of logical reasoning tasks. The research focuses on dissecting whether LLMs depend on memorization to solve reasoning benchmarks and how this proficiency influences their generalization abilities. Utilizing a dynamically generated benchmark based on Knights and Knaves (K) puzzles, the work provides nuanced insights into the balance between memorization and reasoning within LLMs.

The paper reveals that, although LLMs can interpolate training data with near-perfect accuracy after fine-tuning, their performance degrades significantly when these puzzles are perturbed. This behavior suggests a reliance on memorization for solving familiar problems. Nevertheless, the paper also observes that fine-tuning, while leading to heavy memorization, enhances the models' ability to generalize across different tasks, implying that LLMs indeed acquire genuine reasoning capabilities alongside memorization.

Key contributions of the research include developing a memorization score that quantifies the performance inconsistency of LLMs under local perturbations. This metric distinguishes between reasoning-driven and memorization-driven problem-solving. The new K puzzle benchmark additionally supports automatic perturbation and reasoning step synthesis, enabling a robust investigation into how models handle logical reasoning under controlled conditions.

In terms of experimental findings, the paper evaluates 11 open-source models and demonstrates that only advanced models can adequately tackle the K puzzles, with substantial memorization indicated by their performance under perturbations. Fine-tuning experiments with models like Llama3-8B and GPT4o-mini further illustrate that generalization improves as the extent of memorization increases, challenging the notion that memorization is solely a learning hindrance. These results highlight that LLMs, when fine-tuned, develop an intricate balance between memorization and genuine reasoning.

Theoretical implications of this paper underscore the necessity of distinguishing between memorization and reasoning in comparative analyses of LLM performance on reasoning tasks. Practically, these insights are crucial for applications requiring reliable reasoning, such as in fields where safety and trustworthiness are paramount.

Looking forward, this paper signals several potential directions for further exploration. Developing training methodologies that foster reasoning without excessive reliance on memorization remains a key challenge. Additionally, understanding the mechanisms by which LLMs toggle between reasoning and memorization when faced with perturbed tasks could lead to more robust AI systems. The dynamic benchmark introduced offers a promising springboard for such investigations, given its adaptability in generating varied reasoning scenarios.

In conclusion, the research elucidates the dual facets of LLM learning—memorization and reasoning—and provides a comprehensive framework to measure and improve reasoning capabilities in LLMs. By advancing our understanding of how these models work, particularly in logical reasoning tasks, the paper contributes significantly to both the academic discourse and practical advancements in artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

References (70)

Authors (9)

Chulin Xie (27 papers)
Yangsibo Huang (40 papers)
Chiyuan Zhang (57 papers)
Da Yu (19 papers)
Xinyun Chen (80 papers)
Bill Yuchen Lin (72 papers)
Bo Li (1107 papers)
Badih Ghazi (78 papers)
Ravi Kumar (146 papers)

GitHub

Tweets

https://twitter.com/ChulinXie/status/1852427636091564289

https://twitter.com/layerlens_ai/status/1869020848683913639

https://twitter.com/arXivGPT/status/1852459803958022595

https://twitter.com/Feidlimid_Shani/status/1931606693370003664

https://twitter.com/MertonQ855/status/1931740910871609415

https://twitter.com/garret48001/status/1931765978032939047

On Memorization of Large Language Models in Logical Reasoning (2410.23123v1)

Understanding Memorization and Reasoning in LLMs

Related Papers

GitHub

Tweets