Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Memorization of Large Language Models in Logical Reasoning (2410.23123v1)

Published 30 Oct 2024 in cs.CL

Abstract: LLMs achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. This contrasting behavior is puzzling when it comes to understanding the mechanisms behind LLMs' reasoning capabilities. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using a dynamically generated logical reasoning benchmark based on Knights and Knaves (K&K) puzzles. We found that LLMs could interpolate the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet fail when those puzzles are slightly perturbed, suggesting that the models heavily rely on memorization to solve those training puzzles. On the other hand, we show that while fine-tuning leads to heavy memorization, it also consistently improves generalization performance. In-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers suggest that the LLMs learn to reason on K&K puzzles despite training data memorization. This phenomenon indicates that LLMs exhibit a complex interplay between memorization and genuine reasoning abilities. Finally, our analysis with per-sample memorization score sheds light on how LLMs switch between reasoning and memorization in solving logical puzzles. Our code and data are available at https://memkklogic.github.io.

Understanding Memorization and Reasoning in LLMs

The paper "On Memorization of LLMs in Logical Reasoning" explores the interplay between memorization and genuine reasoning in LLMs, particularly in the context of logical reasoning tasks. The research focuses on dissecting whether LLMs depend on memorization to solve reasoning benchmarks and how this proficiency influences their generalization abilities. Utilizing a dynamically generated benchmark based on Knights and Knaves (K) puzzles, the work provides nuanced insights into the balance between memorization and reasoning within LLMs.

The paper reveals that, although LLMs can interpolate training data with near-perfect accuracy after fine-tuning, their performance degrades significantly when these puzzles are perturbed. This behavior suggests a reliance on memorization for solving familiar problems. Nevertheless, the paper also observes that fine-tuning, while leading to heavy memorization, enhances the models' ability to generalize across different tasks, implying that LLMs indeed acquire genuine reasoning capabilities alongside memorization.

Key contributions of the research include developing a memorization score that quantifies the performance inconsistency of LLMs under local perturbations. This metric distinguishes between reasoning-driven and memorization-driven problem-solving. The new K puzzle benchmark additionally supports automatic perturbation and reasoning step synthesis, enabling a robust investigation into how models handle logical reasoning under controlled conditions.

In terms of experimental findings, the paper evaluates 11 open-source models and demonstrates that only advanced models can adequately tackle the K puzzles, with substantial memorization indicated by their performance under perturbations. Fine-tuning experiments with models like Llama3-8B and GPT4o-mini further illustrate that generalization improves as the extent of memorization increases, challenging the notion that memorization is solely a learning hindrance. These results highlight that LLMs, when fine-tuned, develop an intricate balance between memorization and genuine reasoning.

Theoretical implications of this paper underscore the necessity of distinguishing between memorization and reasoning in comparative analyses of LLM performance on reasoning tasks. Practically, these insights are crucial for applications requiring reliable reasoning, such as in fields where safety and trustworthiness are paramount.

Looking forward, this paper signals several potential directions for further exploration. Developing training methodologies that foster reasoning without excessive reliance on memorization remains a key challenge. Additionally, understanding the mechanisms by which LLMs toggle between reasoning and memorization when faced with perturbed tasks could lead to more robust AI systems. The dynamic benchmark introduced offers a promising springboard for such investigations, given its adaptability in generating varied reasoning scenarios.

In conclusion, the research elucidates the dual facets of LLM learning—memorization and reasoning—and provides a comprehensive framework to measure and improve reasoning capabilities in LLMs. By advancing our understanding of how these models work, particularly in logical reasoning tasks, the paper contributes significantly to both the academic discourse and practical advancements in artificial intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In ICLR, 2017.
  2. Leak, cheat, repeat: Data contamination and evaluation malpractices in closed-source LLMs. arXiv preprint arXiv:2402.03927, 2024.
  3. Deep learning: a statistical viewpoint. Acta Numerica, 2021.
  4. Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica, 2021.
  5. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. In NeurIPS, 2018.
  6. Emergent and predictable memorization in large language models. NeurIPS, 2024.
  7. Extracting training data from large language models. In USENIX Security, 2021.
  8. Quantifying memorization across neural language models. In ICLR, 2023.
  9. Premise order matters in reasoning with large language models. In ICML, 2024.
  10. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 2024.
  11. Transformers as soft reasoners over language. In IJCAI, 2020.
  12. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. In ACL, 2018.
  13. Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460, 2023.
  14. From explicit cot to implicit cot: Learning to internalize cot step by step. arXiv preprint arXiv:2405.14838, 2024.
  15. Faith and fate: Limits of transformers on compositionality. NeurIPS, 2024.
  16. Puzzle solving using reasoning of large language models: A survey. In IJCAI, 2024.
  17. Data contamination quiz: A tool to detect and estimate contamination in large language models. arXiv preprint arXiv:2311.06233, 2023.
  18. Changing answer order can decrease MMLU accuracy. arXiv preprint arXiv:2406.19470, 2024.
  19. SoK: memorization in general-purpose large language models. arXiv preprint arXiv:2310.18362, 2023.
  20. Fantastic copyrighted beasts and how (not) to generate them. arXiv preprint arXiv:2406.14526, 2024.
  21. Designing and interpreting probes with control tasks. In EMNLP, 2019.
  22. Large language models are reasoning teachers. In ACL, 2023.
  23. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In ACL, 2023.
  24. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024.
  25. A peek into token bias: Large language models are not yet genuine reasoners. EMNLP, 2024.
  26. Meta-logical problems: Knights, knaves, and rips. Cognition, 1990.
  27. Copyright violations and large language models. In EMNLP, 2023.
  28. BoardgameQA: A dataset for natural language reasoning with contradictory information. In NeurIPS, 2024.
  29. The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. In EMNLP, 2023.
  30. A mechanistic understanding of alignment algorithms: A case study on DPO and toxicity. In ICML, 2024.
  31. Deduplicating training data makes language models better. In ACL, 2022.
  32. Bill Yuchen Lin. Math Olympiad becomes easier for AI; Common sense is still hard., 2024. URL https://x.com/billyuchenlin/status/1812948314360541302.
  33. ZebraLogic: benchmarking the logical reasoning ability of language models, 2024. URL https://hf.co/spaces/allenai/ZebraLogic.
  34. Analyzing leakage of personally identifiable information in language models. In IEEE Symposium on Security and Privacy (SP), 2023.
  35. Data contamination: From memorization to exploitation. arXiv preprint arXiv:2203.08242, 2022.
  36. Embers of autoregression show how large language models are shaped by the problem they are trained to solve. Proceedings of the National Academy of Sciences, 121(41):e2322420121, 2024.
  37. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024.
  38. Liar, liar, logical mire: A benchmark for suppositional reasoning in large language models. arXiv preprint arXiv:2406.12546, 2024.
  39. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 2020.
  40. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models. arXiv preprint arXiv:2406.02061, 2024.
  41. Proving test set contamination in black box language models. ICLR, 2024.
  42. LogicBench: towards systematic evaluation of logical reasoning ability of large language models. In ACL, 2024.
  43. Deciphering the factors influencing the efficacy of chain-of-thought: Probability, memorization, and noisy reasoning. arXiv preprint arXiv:2407.01687, 2024.
  44. Fine-tuning with divergent chains of thought boosts reasoning through self-correction in language models. arXiv preprint arXiv:2407.03181, 2024.
  45. Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of EMNLP 2022, pp.  840–854, 2022.
  46. To the cutoff… and beyond? A longitudinal perspective on LLM data contamination. In ICLR, 2023.
  47. Leveraging large language models for multiple choice question answering. In ICLR, 2023.
  48. Detecting pretraining data from large language models. In ICLR, 2024.
  49. Raymond Smullyan. What is the Name of this Book? Prentice-Hall, 1978.
  50. Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap. arXiv preprint arXiv:2402.19450, 2024.
  51. Memorization without overfitting: Analyzing the training dynamics of large language models. NeurIPS, 2022.
  52. The instruction hierarchy: Training LLMs to prioritize privileged instructions. arXiv preprint arXiv:2404.13208, 2024.
  53. DecodingTrust: a comprehensive assessment of trustworthiness in GPT models. In NeurIPS, 2023.
  54. Beyond the answers: Reviewing the rationality of multiple choice question answering for the evaluation of large language models. arXiv preprint arXiv:2402.01349, 2024.
  55. Assessing the brittleness of safety alignment via pruning and low-rank modifications. In ICML, 2024a.
  56. Evaluating copyright takedown methods for language models. In NeurIPS Datasets and Benchmark, 2024b.
  57. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022.
  58. ConceptMix: A compositional image generation benchmark with controllable difficulty. In NeurIPS Datasets and Benchmark, 2024a.
  59. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.  1819–1862, 2024b.
  60. Benchmarking benchmark leakage in large language models. arXiv preprint arXiv:2404.18824, 2024.
  61. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023.
  62. Data contamination can cross language barriers. arXiv preprint arXiv:2406.13236, 2024.
  63. Physics of language models: Part 2.1, grade-school math and the hidden reasoning process. arXiv preprint arXiv:2407.20311, 2024.
  64. Star: Bootstrapping reasoning with reasoning. NeurIPS, 35:15476–15488, 2022.
  65. MR-GSM8K: a meta-reasoning benchmark for large language model evaluation. arXiv preprint arXiv:2312.17080, 2023.
  66. A careful examination of large language model performance on grade school arithmetic. arXiv preprint arXiv:2405.00332, 2024.
  67. Calibrate before use: Improving few-shot performance of language models. In ICML, 2021.
  68. Revisiting the self-consistency challenges in multi-choice question formats for large language model evaluation. In LREC-COLING, 2024.
  69. Dyval: Graph-informed dynamic evaluation of large language models. In ICLR, 2024.
  70. Fool your (vision and) language model with embarrassingly simple permutations. ICML, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Chulin Xie (27 papers)
  2. Yangsibo Huang (40 papers)
  3. Chiyuan Zhang (57 papers)
  4. Da Yu (19 papers)
  5. Xinyun Chen (80 papers)
  6. Bill Yuchen Lin (72 papers)
  7. Bo Li (1107 papers)
  8. Badih Ghazi (78 papers)
  9. Ravi Kumar (146 papers)
Github Logo Streamline Icon: https://streamlinehq.com