Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BLT: Can Large Language Models Handle Basic Legal Text? (2311.09693v3)

Published 16 Nov 2023 in cs.CL and cs.AI

Abstract: We find that the best publicly available LLMs like GPT-4 and Claude currently perform poorly on basic legal text handling. This motivates the creation of a benchmark consisting of examples that lawyers and paralegals would expect LLMs to handle zero-shot, such as looking up the text at a line of a witness deposition or at a subsection of a contract. LLMs' poor performance on this benchmark casts into doubt their reliability as-is for legal practice. However, fine-tuning on our training set brings even a small model to near-perfect performance. This benchmark will be useful for fine-tuning LLMs for downstream legal tasks, as well as for tracking LLMs' reliability as-is for basic legal tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Abdi Aidid and Benjamin Alarie. 2023. The Legal Singularity: How Artificial Intelligence Can Make Law Radically Better. University of Toronto Press.
  2. Can GPT-3 Perform Statutory Reasoning? In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, ICAIL ’23, page 22–31.
  3. Bluebook. 2020. The Bluebook: A Uniform System of Citation, 21st edition.
  4. Neural legal judgment prediction in English. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4317–4323, Florence, Italy. Association for Computational Linguistics.
  5. LexGLUE: A benchmark dataset for legal language understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, Dublin, Ireland. Association for Computational Linguistics.
  6. Successive prompting for decomposing complex questions. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1251–1265, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  7. Lawbench: Benchmarking legal knowledge of large language models.
  8. Bryan Garner, editor. 2019. Black’s Law Dictionary.
  9. Legalbench: Prototyping a collaborative benchmark for legal reasoning.
  10. Cuad: An expert-annotated NLP dataset for legal contract review. Advances in Neural Information Processing Systems.
  11. A dataset for statutory reasoning in tax law entailment and question answering. In Proceedings of the Natural Legal Language Processing Workshop 2020 co-located with the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD 2020), Virtual Workshop, August 24, 2020, volume 2645 of CEUR Workshop Proceedings, pages 31–38. CEUR-WS.org.
  12. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  13. COLIEE 2022 summary: Methods for legal document retrieval and entailment. In New Frontiers in Artificial Intelligence - JSAI-isAI 2022 Workshop, JURISIN 2022, and JSAI 2022 International Session, Kyoto, Japan, June 12-17, 2022, Revised Selected Papers, volume 13859 of Lecture Notes in Computer Science, pages 51–67. Springer.
  14. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  15. Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union.
  16. Confidence sequences for evaluating one-phase technology-assisted review. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, ICAIL ’23, page 131–140.
  17. Lost in the middle: How language models use long contexts.
  18. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9).
  19. Steve Lohr. 2023. A.I. threatens lawyers? We’ve heard this before. New York Times, page B1.
  20. Eric Martínez. 2023. Re-evaluating gpt-4’s bar exam performance.
  21. Masha Medvedeva and Pauline McBride. 2023. Legal judgment prediction: If you are going to do it, do it right. In Proceedings of the Natural Legal Language Processing Workshop 2023, pages 73–84, Singapore. Association for Computational Linguistics.
  22. Large language models as tax attorneys: A case study in legal capabilities emergence.
  23. OpenAI. 2023a. Gpt-4 developer livestream.
  24. OpenAI. 2023b. Gpt-4 technical report.
  25. Exploration of open large language models for eDiscovery. In Proceedings of the Natural Legal Language Processing Workshop 2023, pages 166–177, Singapore. Association for Computational Linguistics.
  26. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014.
  27. Drescher ProParalegal. 2017. Practice tip sheet.
  28. Questions about contracts: Prompt templates for structured answer generation. In Proceedings of the Natural Legal Language Processing Workshop 2023, pages 62–72, Singapore. Association for Computational Linguistics.
  29. Can GPT-4 support analysis of textual data in tasks requiring highly specialized domain expertise? In Proceedings of the 6th Workshop on Automated Semantic Analysis of Information in Legal Text co-located with the 19th International Conference on Artificial Intelligence and Law (ICAIL 2023), Braga, Portugal, 23rd September, 2023, volume 3441 of CEUR Workshop Proceedings, pages 1–12. CEUR-WS.org.
  30. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
  31. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
  32. A prompt pattern catalog to enhance prompt engineering with chatgpt.
  33. CAIL2018: A large-scale legal dataset for judgment prediction. CoRR, abs/1807.02478.
  34. Legal prompting: Teaching a language model to think like a lawyer. CoRR, abs/2212.01326.
  35. Libin Zhang. 2023. Tax questions for language models. Tax Notes Federal, 179:1699.
  36. How does nlp benefit legal system: A summary of legal artificial intelligence. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5218–5230.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Andrew Blair-Stanek (8 papers)
  2. Nils Holzenberger (15 papers)
  3. Benjamin Van Durme (173 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com