Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models Reflect Human Citation Patterns with a Heightened Citation Bias (2405.15739v3)

Published 24 May 2024 in cs.DL, cs.AI, cs.LG, and cs.SI

Abstract: Citation practices are crucial in shaping the structure of scientific knowledge, yet they are often influenced by contemporary norms and biases. The emergence of LLMs introduces a new dynamic to these practices. Interestingly, the characteristics and potential biases of references recommended by LLMs that entirely rely on their parametric knowledge, and not on search or retrieval-augmented generation, remain unexplored. Here, we analyze these characteristics in an experiment using a dataset from AAAI, NeurIPS, ICML, and ICLR, published after GPT-4's knowledge cut-off date. In our experiment, LLMs are tasked with suggesting scholarly references for the anonymized in-text citations within these papers. Our findings reveal a remarkable similarity between human and LLM citation patterns, but with a more pronounced high citation bias, which persists even after controlling for publication year, title length, number of authors, and venue. The results hold for both GPT-4, and the more capable models GPT-4o and Claude 3.5 where the papers are part of the training data. Additionally, we observe a large consistency between the characteristics of LLM's existing and non-existent generated references, indicating the model's internalization of citation patterns. By analyzing citation graphs, we show that the references recommended are embedded in the relevant citation context, suggesting an even deeper conceptual internalization of the citation networks. While LLMs can aid in citation generation, they may also amplify existing biases, such as the Matthew effect, and introduce new ones, potentially skewing scientific knowledge dissemination.

Overview of the Influence of LLMs on Human Citation Patterns

The paper "LLMs Reflect Human Citation Patterns with a Heightened Citation Bias" provides an empirical examination of how LLMs, specifically several versions of GPT-4 and Claude 3.5, replicate and potentially exaggerate human citation patterns in academic settings. The authors conducted a detailed experiment using a dataset containing papers from prominent conferences such as AAAI, NeurIPS, ICML, and ICLR, focusing on assessing the citation behavior of LLMs in suggesting scholarly references.

Key Findings

The research reveals significant insights into the way LLMs generate references:

  1. Reflection of Human Patterns: The LLMs show a notable resemblance to human citation patterns, albeit with a heightened bias towards highly cited works. This bias is robust and persists even when controlling for a variety of confounding factors, including publication year and venue characteristics.
  2. Consistency Across Models: The patterns observed in GPT-4 are consistent across its different versions (like GPT-4o) and other models such as Claude 3.5, suggesting a systematic bias ingrained within the models due to their training data.
  3. Citation Graph Embedding: The generated references are not randomly distributed but are contextually embedded within the citation graphs relevant to the field. This indicates a deeper conceptual internalization of citation networks by these models.
  4. Bias Towards High Citation: The most striking bias observed is an inclination of LLMs to favor references with a high citation count. This tendency is independent of other features and highlights a potential amplification of the "Matthew effect" in citation dynamics, where prominent papers continue to receive more attention.

Implications

The implications of these findings span both practical and theoretical realms:

  • Practical Considerations: While LLMs can accelerate academic workflows, especially in generating and recommending citations, the observed biases necessitate cautious deployment in scholarly contexts. Emphasizing or amplifying certain citations can skew the landscape of academic discourse, favoring well-cited over potentially innovative under-cited works.
  • Theoretical Insights: The work highlights the relationship between LLM training data properties and their output characteristics. It stresses the need for developing mitigation strategies to handle biases in training regimes of future models to avoid perpetuating historical and systemic biases in academic dissemination.

Future Directions

The paper prompts further investigation into several avenues:

  • Broader Dataset Evaluation: Extending the analysis across diverse datasets could illuminate discipline-specific citation patterns and biases. Such analysis would help in understanding how LLM biases manifest in less homogeneous datasets.
  • Optimization Techniques: Research into advanced prompt engineering and retrieval-augmented generation could address the citation bias by integrating external databases to provide more balanced reference suggestions.
  • Bias Mitigation: Implementation of strategies such as online learning adjustments or bias-corrective algorithmic interventions might be necessary to tune LLM outputs more finely to human needs without undesired exaggerations.

The paper underscores the potential of LLMs to both innovate and inadvertently perpetuate existing citation tendencies in academia. As deployment of these models continues to rise, their role in shaping future academic ecosystems must be carefully managed and understood. The paper thus initiates an essential dialogue on navigating the blend of artificial intelligence with traditional scholarly practices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Large language models show human-like content biases in transmission chain experiments. Proceedings of the National Academy of Sciences, 120(44):e2313790120.
  2. Do language models know when they’re hallucinating references? In Graham, Y. and Purver, M., editors, Findings of the Association for Computational Linguistics: EACL 2024, pages 912–928, St. Julian’s, Malta. Association for Computational Linguistics.
  3. Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332.
  4. What do citation counts measure? a review of studies on citing behavior. Journal of documentation, 64(1):45–80.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  7. How to optimize the systematic review process using ai tools. JCPP Advances, page e12234.
  8. Science of science. Science, 359(6379):eaao0185.
  9. Investigating different types of research collaboration and citation impact: a case study of harvard university’s publications. Scientometrics, 87(2):251–265.
  10. Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors, Advances in Neural Information Processing Systems, volume 36, pages 44753–44775. Curran Associates, Inc.
  11. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  12. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations.
  13. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  14. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169.
  15. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR.
  16. The semantic scholar open data platform. arXiv preprint arXiv:2301.10140.
  17. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  18. Lawrence, P. A. (2003). The politics of publication. Nature, 422(6929):259–261.
  19. Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data. arXiv preprint arXiv:2306.13840.
  20. The advantage of short paper titles. Royal Society open science, 2(8):150266.
  21. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  22. Fairness-guided few-shot prompting for large language models. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors, Advances in Neural Information Processing Systems, volume 36, pages 43136–43155. Curran Associates, Inc.
  23. Social bias probing: Fairness benchmarking for language models. arXiv preprint arXiv:2311.09090.
  24. Scaling deep learning for materials discovery. Nature, 624(7990):80–85.
  25. Biases in large language models: origins, inventory, and discussion. ACM Journal of Data and Information Quality, 15(2):1–21.
  26. OpenAI (2023). Gpt-4 technical report. arXiv preprint arXiV:4812508.
  27. Are chatgpt and large language models “the answer” to bringing us closer to systematic review automation? Systematic Reviews, 12(1):72.
  28. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475.
  29. Smith, D. R. (2012). Impact factors, scientometrics and the history of citation-based research. Scientometrics, 92(2):419–427.
  30. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  31. Automating research synthesis with domain-specific large language model fine-tuning. arXiv preprint arXiv:2404.08680.
  32. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  34. Identifying meaningful citations. In AAAI Workshop: Scholarly Big Data.
  35. Fabrication and errors in the bibliographic citations generated by chatgpt. Scientific Reports, 13(1):14045.
  36. Wang, J. (2014). Unpacking the Matthew effect in citations. Journal of Informetrics, 8(2):329–339.
  37. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  38. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  39. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  40. Autoformalization with large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 32353–32368. Curran Associates, Inc.
  41. Knowledge conflicts for llms: A survey. arXiv preprint arXiv:2403.08319.
  42. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
  43. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  44. Large language models for scientific synthesis, inference and explanation. arXiv preprint arXiv:2310.07984.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Andres Algaba (9 papers)
  2. Carmen Mazijn (3 papers)
  3. Vincent Holst (4 papers)
  4. Floriano Tori (4 papers)
  5. Sylvia Wenmackers (10 papers)
  6. Vincent Ginis (18 papers)
Citations (1)