2000 character limit reached
BLT: Can Large Language Models Handle Basic Legal Text? (2311.09693v3)
Published 16 Nov 2023 in cs.CL and cs.AI
Abstract: We find that the best publicly available LLMs like GPT-4 and Claude currently perform poorly on basic legal text handling. This motivates the creation of a benchmark consisting of examples that lawyers and paralegals would expect LLMs to handle zero-shot, such as looking up the text at a line of a witness deposition or at a subsection of a contract. LLMs' poor performance on this benchmark casts into doubt their reliability as-is for legal practice. However, fine-tuning on our training set brings even a small model to near-perfect performance. This benchmark will be useful for fine-tuning LLMs for downstream legal tasks, as well as for tracking LLMs' reliability as-is for basic legal tasks.
- Abdi Aidid and Benjamin Alarie. 2023. The Legal Singularity: How Artificial Intelligence Can Make Law Radically Better. University of Toronto Press.
- Can GPT-3 Perform Statutory Reasoning? In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, ICAIL ’23, page 22–31.
- Bluebook. 2020. The Bluebook: A Uniform System of Citation, 21st edition.
- Neural legal judgment prediction in English. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4317–4323, Florence, Italy. Association for Computational Linguistics.
- LexGLUE: A benchmark dataset for legal language understanding in English. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, Dublin, Ireland. Association for Computational Linguistics.
- Successive prompting for decomposing complex questions. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1251–1265, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Lawbench: Benchmarking legal knowledge of large language models.
- Bryan Garner, editor. 2019. Black’s Law Dictionary.
- Legalbench: Prototyping a collaborative benchmark for legal reasoning.
- Cuad: An expert-annotated NLP dataset for legal contract review. Advances in Neural Information Processing Systems.
- A dataset for statutory reasoning in tax law entailment and question answering. In Proceedings of the Natural Legal Language Processing Workshop 2020 co-located with the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD 2020), Virtual Workshop, August 24, 2020, volume 2645 of CEUR Workshop Proceedings, pages 31–38. CEUR-WS.org.
- Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- COLIEE 2022 summary: Methods for legal document retrieval and entailment. In New Frontiers in Artificial Intelligence - JSAI-isAI 2022 Workshop, JURISIN 2022, and JSAI 2022 International Session, Kyoto, Japan, June 12-17, 2022, Revised Selected Papers, volume 13859 of Lecture Notes in Computer Science, pages 51–67. Springer.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710. Soviet Union.
- Confidence sequences for evaluating one-phase technology-assisted review. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, ICAIL ’23, page 131–140.
- Lost in the middle: How language models use long contexts.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9).
- Steve Lohr. 2023. A.I. threatens lawyers? We’ve heard this before. New York Times, page B1.
- Eric Martínez. 2023. Re-evaluating gpt-4’s bar exam performance.
- Masha Medvedeva and Pauline McBride. 2023. Legal judgment prediction: If you are going to do it, do it right. In Proceedings of the Natural Legal Language Processing Workshop 2023, pages 73–84, Singapore. Association for Computational Linguistics.
- Large language models as tax attorneys: A case study in legal capabilities emergence.
- OpenAI. 2023a. Gpt-4 developer livestream.
- OpenAI. 2023b. Gpt-4 technical report.
- Exploration of open large language models for eDiscovery. In Proceedings of the Natural Legal Language Processing Workshop 2023, pages 166–177, Singapore. Association for Computational Linguistics.
- Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014.
- Drescher ProParalegal. 2017. Practice tip sheet.
- Questions about contracts: Prompt templates for structured answer generation. In Proceedings of the Natural Legal Language Processing Workshop 2023, pages 62–72, Singapore. Association for Computational Linguistics.
- Can GPT-4 support analysis of textual data in tasks requiring highly specialized domain expertise? In Proceedings of the 6th Workshop on Automated Semantic Analysis of Information in Legal Text co-located with the 19th International Conference on Artificial Intelligence and Law (ICAIL 2023), Braga, Portugal, 23rd September, 2023, volume 3441 of CEUR Workshop Proceedings, pages 1–12. CEUR-WS.org.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
- A prompt pattern catalog to enhance prompt engineering with chatgpt.
- CAIL2018: A large-scale legal dataset for judgment prediction. CoRR, abs/1807.02478.
- Legal prompting: Teaching a language model to think like a lawyer. CoRR, abs/2212.01326.
- Libin Zhang. 2023. Tax questions for language models. Tax Notes Federal, 179:1699.
- How does nlp benefit legal system: A summary of legal artificial intelligence. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5218–5230.
- Andrew Blair-Stanek (8 papers)
- Nils Holzenberger (15 papers)
- Benjamin Van Durme (173 papers)