LawBench: Benchmarking Legal Knowledge of Large Language Models (2309.16289v1)

Published 28 Sep 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have demonstrated strong capabilities in various aspects. However, when applying them to the highly specialized, safe-critical legal domain, it is unclear how much legal knowledge they possess and whether they can reliably perform legal-related tasks. To address this gap, we propose a comprehensive evaluation benchmark LawBench. LawBench has been meticulously crafted to have precise assessment of the LLMs' legal capabilities from three cognitive levels: (1) Legal knowledge memorization: whether LLMs can memorize needed legal concepts, articles and facts; (2) Legal knowledge understanding: whether LLMs can comprehend entities, events and relationships within legal text; (3) Legal knowledge applying: whether LLMs can properly utilize their legal knowledge and make necessary reasoning steps to solve realistic legal tasks. LawBench contains 20 diverse tasks covering 5 task types: single-label classification (SLC), multi-label classification (MLC), regression, extraction and generation. We perform extensive evaluations of 51 LLMs on LawBench, including 20 multilingual LLMs, 22 Chinese-oriented LLMs and 9 legal specific LLMs. The results show that GPT-4 remains the best-performing LLM in the legal domain, surpassing the others by a significant margin. While fine-tuning LLMs on legal specific text brings certain improvements, we are still a long way from obtaining usable and reliable LLMs in legal tasks. All data, model predictions and evaluation code are released in https://github.com/open-compass/LawBench/. We hope this benchmark provides in-depth understanding of the LLMs' domain-specified capabilities and speed up the development of LLMs in the legal domain.

PDF Abstract

Overview of LawBench: Benchmarking Legal Knowledge of LLMs

The paper presents LawBench, a comprehensive evaluation benchmark designed to assess the legal capabilities of LLMs within the Chinese civil-law system. This benchmark addresses the crucial aspect of evaluating domain-specific knowledge, particularly in the legal field, which demands a detailed understanding of specialized texts and concepts. The paper underscores the importance of such benchmarks in understanding the potential and limitations of LLMs in executing legal-oriented tasks.

Key Features of LawBench

LawBench evaluates LLMs across three cognitive dimensions inspired by Bloom's Taxonomy:

Legal Knowledge Memorization: This dimension assesses the ability of LLMs to memorize and recall legal concepts, facts, and articles.
Legal Knowledge Understanding: This dimension measures the comprehension capabilities of LLMs, focusing on their ability to understand legal texts, entities, and relationships.
Legal Knowledge Applying: This dimension evaluates the ability to apply legal knowledge in realistic, problem-solving scenarios, necessitating reasoning and analytical skills.

The benchmark encompasses 20 tasks spreading across these cognitive dimensions, categorized into five types of tasks: single-label classification, multi-label classification, regression, extraction, and generation. LawBench features tasks such as article recitation, dispute focus identification, named entity recognition, case analysis, and consultation among others.

Comprehensive Evaluation of LLMs

The paper evaluates 51 different LLMs, distributed across multilingual models, Chinese-oriented models, and legal-specific models, covering a wide range of architectures and sizes. Notably, GPT-4 outperforms other LLMs, reflecting its superior capability in the legal domain. However, while fine-tuning LLMs on legal-specific corpora demonstrates some improvement, the results indicate a considerable gap before LLMs can reliably handle legal tasks in practice.

Significant Findings and Analysis

Model Performance: GPT-4 leads significantly, especially in handling complex legal tasks. The analysis shows that larger models generally perform better, particularly in the one-shot setting.
Fine-tuning Implications: The improvement from legal specific fine-tuning is evident, though it remains insufficient compared to general-purpose performance observed in models like GPT-4.
Retrieval-Augmentation Challenges: The models struggle to utilize retrieval-augmented data effectively, indicating an area for further research in making retrieval mechanisms more efficient within LLMs.
Rule-based and Soft Metric Evaluation: Given legal tasks often have intricate answer extraction needs, soft metric evaluation like soft-F1 for extraction tasks indicates the paper's attention to nuanced evaluation methods, especially where LLM generation diverges linguistically from ground-truth labels.

Implications and Future Directions

The introduction of LawBench provides a structured methodology to evaluate LLMs in legal domains, especially those rooted in the civil law system prevalent in China. The paper implies a need for collaborative efforts to develop high-quality, reliable legal AI systems that are not only proficient in English law but are adaptable to various legal systems globally.

Furthermore, the results suggest future work might explore enhanced pre-training strategies, better alignment of legal data within LLMs, and improved retrieval-augmented approaches that help bridge the gap between human legal expertise and machine learning capabilities.

In essence, this research marks an important advance in quantifying and qualifying the legal domain capabilities of current LLMs, setting a foundation for future explorations into domain-specific AI applications.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Zhiwei Fei (4 papers)
Xiaoyu Shen (73 papers)
Dawei Zhu (46 papers)
Fengzhe Zhou (7 papers)
Zhuo Han (2 papers)
Songyang Zhang (116 papers)
Kai Chen (512 papers)
Zongwen Shen (1 paper)
Jidong Ge (17 papers)

Citations (24)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

YouTube

Show All Videos