Hallucinate or Memorize? The Two Sides of Probabilistic Learning in Large Language Models

Published 12 Nov 2025 in cs.CL and cs.AI | (2511.08877v1)

Abstract: LLMs have been increasingly applied to a wide range of tasks, from natural language understanding to code generation. While they have also been used to assist in citation recommendation, the hallucination of non-existent papers remains a major issue. Building on prior studies, this study hypothesizes that an LLM's ability to correctly produce bibliographic records depends on whether the underlying knowledge is generated or memorized, with highly cited papers (i.e., more frequently appear in the pretraining corpus) showing lower hallucination rates. We therefore assume citation count as a proxy for training data redundancy (i.e., the frequency with which a given bibliographic record appears in the pretraining corpus) and investigate how citation frequency affects hallucinated references in LLM outputs. Using GPT-4.1, we generated and manually verified 100 citations across twenty computer-science domains, and measured factual consistency via cosine similarity between generated and authentic metadata. The results revealed that (i) citation count is strongly correlated with factual accuracy, (ii) bibliographic information becomes almost verbatim memorized beyond roughly 1,000 citations, and (iii) memory interference occurs when multiple highly cited papers share similar content. These findings indicate a threshold where generalization shifts into memorization, with highly cited papers being nearly verbatim retained in the model.

Abstract PDF Chat (Pro)

Summary

The paper demonstrates a strong correlation (r = 0.75, p < .001) between citation count and the factual accuracy of generated bibliographic entries.
The study identifies threshold behaviors, with reliable memory emerging around 90 citations and near-perfect recall beyond 1,248 citations.
The research offers actionable insights for enhancing LLM training protocols to reduce hallucinations and improve citation recommendations.

Hallucinate or Memorize? The Two Sides of Probabilistic Learning in LLMs

Introduction

The paper "Hallucinate or Memorize? The Two Sides of Probabilistic Learning in LLMs" (2511.08877) addresses the critical issue of hallucinations in LLMs, particularly concerning bibliographic references. Through empirical investigation, the study elucidates how the memorization capacity of LLMs is influenced by citation frequency, which serves as a proxy for the redundancy of data in the training corpus. The research hypothesizes a correlation between citation count and the accuracy of generated bibliographic entries, revealing insights into how frequently cited papers are more reliably recalled by LLMs.

Citation Frequency and Memorization

The core of the study is the exploration of how citation frequency impacts the generation fidelity of bibliographic records. By analyzing 100 bibliographic records across twenty diverse domains, the paper establishes that there is a strong positive correlation between the citation count of a paper and the factual accuracy of its generated metadata. The study indicates that bibliographic records of papers with citation counts exceeding approximately 1,000 are nearly memorized verbatim by LLMs. This phenomenon suggests a threshold beyond which LLMs transition from probabilistic generation to deterministic recall.

Figure 1: Relationship between citation frequency and generation fidelity. Each dot represents a factual record (score > 0), colored by research domain. The regression line indicates fitted linear regression with 95% confidence interval (gray band). Strong correlation (r = 0.75, p < .001) demonstrates a log-linear scaling relationship.

Experimental Design and Results

The experimental design involved generating bibliographic records using GPT-4.1, structured in JSON format, across several topics in computer science. The generated records were subsequently verified for factual accuracy against sources like Google Scholar. The evaluation revealed that while high citation counts minimize hallucination risks, even papers with lower citation frequencies can show variability in memorization. The paper further identifies threshold behaviors in citation recommendation, with two critical citation frequency points: approximately 90 citations mark the beginning of more reliable recall, and saturation occurs beyond 1,248 citations where near-perfect recall is observed.

Implications and Future Work

The paper's findings underscore a systematic relationship between training data redundancy and LLM hallucination and memorization tendencies, impacting the design and utilization of LLMs for citation recommendations. Practically, the study's insights could influence the development of LLM architectures and training protocols, enhancing their reliability in academic settings. Future work should investigate the behavior across different models and domains, consider larger sample sizes for more nuanced insights, and explore multilingual contexts to understand cross-linguistic dynamics in LLM memorization.

Conclusion

The research sheds light on the probabilistic nature of LLMs in generating academic references, indicating that hallucinations and memorization are facets of the same underlying probabilistic process. While highly cited papers benefit from redundant exposure leading to accurate recall, less cited works face significant challenges. Addressing these challenges involves leveraging insights from this study to improve LLM training methodologies and mechanisms to mitigate hallucinations, thus enhancing the integrity and utility of LLMs in citation recommendation systems.