Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models (2401.01301v2)

Published 2 Jan 2024 in cs.CL, cs.AI, and cs.CY
Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models

Abstract: Do LLMs know the law? These models are increasingly being used to augment legal practice, education, and research, yet their revolutionary potential is threatened by the presence of hallucinations -- textual output that is not consistent with legal facts. We present the first systematic evidence of these hallucinations, documenting LLMs' varying performance across jurisdictions, courts, time periods, and cases. Our work makes four key contributions. First, we develop a typology of legal hallucinations, providing a conceptual framework for future research in this area. Second, we find that legal hallucinations are alarmingly prevalent, occurring between 58% of the time with ChatGPT 4 and 88% with Llama 2, when these models are asked specific, verifiable questions about random federal court cases. Third, we illustrate that LLMs often fail to correct a user's incorrect legal assumptions in a contra-factual question setup. Fourth, we provide evidence that LLMs cannot always predict, or do not always know, when they are producing legal hallucinations. Taken together, our findings caution against the rapid and unsupervised integration of popular LLMs into legal tasks. Even experienced lawyers must remain wary of legal hallucinations, and the risks are highest for those who stand to benefit from LLMs the most -- pro se litigants or those without access to traditional legal resources.

Understanding Legal Hallucinations in AI

LLMs, like ChatGPT and others, hold promise for revolutionizing the legal industry by automating some tasks traditionally done by lawyers. However, as this paper reveals, the road ahead is not without pitfalls. A pressing concern is the phenomenon of "legal hallucinations"—when these models generate responses inconsistent with legal facts.

The Extent of Legal Hallucinations

An extensive examination revealed that legal hallucinations occur alarmingly often. Law-specific queries to models like ChatGPT and Llama induced incorrect responses between 69% and 88% of the time. Interestingly, the occurrence of inaccurate information was connected to several factors ranging from the complexity of legal queries to the hierarchy of courts involved. For example, the frequency of hallucinations intensified for queries about lower court cases as opposed to the Supreme Court.

Models’ Response to Erroneous Legal Premises

Further complicating matters, LLMs displayed a troubling inclination to reinforce incorrect legal assumptions presented by users. When faced with questions built upon false legal premises, the models often failed to correct these assumptions and responded as if they were true, thus misleading users.

Predicting Hallucinations

Another layer to this challenge is the LLMs' ability to predict or be aware of their own hallucinations. In ideal circumstances, LLMs would be calibrated to recognize and convey when they are likely issuing a non-factual response. However, the paper found that models, particularly Llama 2, were poorly calibrated, often expressing undue confidence in their hallucinated responses.

Implications for Legal Practice

The implications are significant. While the use of LLMs in legal settings presents opportunities for making legal advice more accessible, these technologies are not yet reliable enough to be used unsupervised, especially by those less versed in legal procedures. The research thus calls for cautious adoption of LLMs in the legal domain and emphasizes that even skilled attorneys need to remain vigilant while using these tools.

Future Directions for Research and Use

The paper's findings underscore that combating legal hallucinations in LLMs is not only an empirical challenge but also a normative one. Developers must decide which contradictions to minimize—those of the training corpus, the user's inputs, or the external facts—and communicate these decisions clearly.

As a way forward, developers must make informed choices about how their models reconcile these inherent conflicts. Users, legal professionals, or otherwise, should be aware of these dynamics and deploy LLMs with a critical eye, constantly validating the accuracy and certainty of the generated legal texts. Until these challenges are fully addressed, the full potential of LLMs in augmenting legal research and democratizing access to justice remains unrealized.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Do Language Models Know When They’re Hallucinating References?
  2. Bob Ambrogi. 2023. As Allen & Overy Deploys GPT-based Legal App Harvey Firmwide, Founders Say Other Firms Will Soon Follow. LawSites.
  3. PaLM 2 Technical Report.
  4. Amos Azaria and Tom Mitchell. 2023. The Internal State of an LLM Knows When It’s Lying.
  5. Ryan C. Black and James F. Spriggs, II. 2013. The Citation and Depreciation of U.S. Supreme Court Precedent. Journal of Empirical Legal Studies, 10(2):325–358.
  6. Can GPT-3 Perform Statutory Reasoning? In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, Braga, Portugal. Association for Computing Machinery.
  7. On the Opportunities and Risks of Foundation Models.
  8. Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3340–3354, Dublin, Ireland. Association for Computational Linguistics.
  9. Faithful to the Original: Fact Aware Neural Abstractive Summarization. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
  10. Quantifying Memorization Across Neural Language Models.
  11. Caselaw Access Project. 2023. Caselaw Access Project.
  12. Casetext. 2023. Cocounsel harnesses gpt4’s power to deliver results that legal professionals can rely on.
  13. Seherman Chann. 2023. Non-determinism in GPT-4 is caused by Sparse MoE. https://152334H.github.io/blog/non-determinism-in-gpt-4/.
  14. ChatGPT Goes to Law School. Journal of Legal Education, 71(3):387–400.
  15. Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1):37–46.
  16. Congress.gov. 2023. Table of Supreme Court Decisions Overruled by Subsequent Decisions. https://constitution.congress.gov/resources/decisions-overruled/.
  17. ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases.
  18. Eyecite: A tool for parsing legal citations. Journal of Open Source Software, 6(66):3617.
  19. How Ready are Pre-trained Abstractive Models and LLMs for Legal Case Judgement Summarization?
  20. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
  21. Chris Draper and Nicky Gillibrand. 2023. The Potential for Jurisdictional Challenges to AI or LLM Training Datasets. In Proceedings of the ICAIL 2023 Workshop on Artificial Intelligence for Access to Justice, Braga, Portugal. CEUR Workshop Proceedings.
  22. Ronald Dworkin. 1986. Law’s Empire. Harvard University Press, Cambridge, MA.
  23. LawBench: Benchmarking Legal Knowledge of Large Language Models.
  24. Diego de Vargas Feijo and Viviane P. Moreira. 2023. Improving abstractive summarization of legal rulings through textual entailment. Artificial Intelligence and Law, 31(1):91–113.
  25. Network Analysis and the Law: Measuring the Legal Importance of Precedents at the U.S. Supreme Court. Political Analysis, 15(3):324–346.
  26. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models.
  27. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330. PMLR.
  28. Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset.
  29. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12):248:1–248:38.
  30. Erik Jones and Jacob Steinhardt. 2022. Capturing Failures of Large Language Models via Human Cognitive Biases.
  31. Language Models (Mostly) Know What They Know.
  32. Adam Tauman Kalai and Santosh S Vempala. 2023. Calibrated language models must hallucinate. arXiv preprint arXiv:2311.14648.
  33. Jon Kleinberg and Manish Raghavan. 2021. Algorithmic Monoculture and Social Welfare. Proceedings of the National Academy of Sciences, 118(22):e2018340118.
  34. Hurdles to Progress in Long-form Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online. Association for Computational Linguistics.
  35. Ashlyn K. Kuersten and Susan B. Haire. 2011. Update to the Appeals Courts Database (1997–2002).
  36. Verified Uncertainty Calibration.
  37. J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1):159–174.
  38. Factuality Enhanced Language Models for Open-Ended Text Generation.
  39. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models.
  40. TruthfulQA: Measuring How Models Mimic Human Falsehoods.
  41. Joseph Luft and Harrington Ingham. 1955. The Johari Window as a graphic model of interpersonal awareness. In Proceedings of the Western Training Laboratory in Group Development. University of California, Los Angeles, Extension Office.
  42. SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models.
  43. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  44. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation.
  45. Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation.
  46. Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence.
  47. OpenAI. 2023a. GPT-4 Technical Report.
  48. OpenAI. 2023b. Introducing ChatGPT. https://openai.com/blog/chatgpt.
  49. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback.
  50. Andrew Perlman. 2023. The Implications of ChatGPT for Legal Services and Society. The Practice, (March/April).
  51. From Sparse to Soft Mixtures of Experts.
  52. James Romoser. 2023. No, Ruth Bader Ginsburg did not dissent in Obergefell — and other things ChatGPT gets wrong about the Supreme Court.
  53. Explaining Legal Concepts with Augmented Large Language Models (GPT-4).
  54. Towards understanding sycophancy in language models.
  55. Retrieval Augmentation Reduces Hallucination in Conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  56. Drew Simshaw. 2022. Access to A.I. Justice: Avoiding an Inequitable Two-Tiered System of Legal Services. Yale Journal of Law & Technology, 24:150–226.
  57. Donald R. Songer. 2008. The United States Courts of Appeals Database, 1925–1996.
  58. 2022 Supreme Court Database, Version 2022 Release 01. http://supremecourtdatabase.org/.
  59. Do Large Language Models Show Decision Heuristics Similar to Humans? A Case Study Using GPT-3.5.
  60. ChatGPT as an Artificial Lawyer? In Proceedings of the ICAIL 2023 Workshop on Artificial Intelligence for Access to Justice, Braga, Portugal. CEUR Workshop Proceedings.
  61. Fine-tuning Language Models for Factuality.
  62. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback.
  63. Joel Tito. 2017. How AI can improve access to justice.
  64. Llama 2: Open Foundation and Fine-Tuned Chat Models.
  65. Large Language Models in Cryptocurrency Securities Cases: Can ChatGPT Replace Lawyers?
  66. Amos Tversky and Daniel Kahneman. 1974. Judgment under Uncertainty: Heuristics and Biases. Science, 185(4157):1124–1131.
  67. Entailment as Few-Shot Learner.
  68. Simple synthetic data reduces sycophancy in large language models.
  69. Benjamin Weiser. 2023. Here’s What Happens When Your Lawyer Uses ChatGPT. The New York Times.
  70. Ludwig Wittgenstein. 1998 [1921]. Tractatus Logico-Philosophicus. Dover. "Translated by C. K. Ogden".
  71. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs.
  72. Understanding and Detecting Hallucinations in Neural Machine Translation via Model Introspection.
  73. Do Large Language Models Know What They Don’t Know?
  74. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, pages 295–305, New York, NY, USA. Association for Computing Machinery.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Matthew Dahl (3 papers)
  2. Varun Magesh (2 papers)
  3. Mirac Suzgun (23 papers)
  4. Daniel E. Ho (45 papers)
Citations (44)
Youtube Logo Streamline Icon: https://streamlinehq.com