LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain (2408.10343v1)

Published 19 Aug 2024 in cs.AI

Abstract: Retrieval-Augmented Generation (RAG) systems are showing promising potential, and are becoming increasingly relevant in AI-powered legal applications. Existing benchmarks, such as LegalBench, assess the generative capabilities of LLMs in the legal domain, but there is a critical gap in evaluating the retrieval component of RAG systems. To address this, we introduce LegalBench-RAG, the first benchmark specifically designed to evaluate the retrieval step of RAG pipelines within the legal space. LegalBench-RAG emphasizes precise retrieval by focusing on extracting minimal, highly relevant text segments from legal documents. These highly relevant snippets are preferred over retrieving document IDs, or large sequences of imprecise chunks, both of which can exceed context window limitations. Long context windows cost more to process, induce higher latency, and lead LLMs to forget or hallucinate information. Additionally, precise results allow LLMs to generate citations for the end user. The LegalBench-RAG benchmark is constructed by retracing the context used in LegalBench queries back to their original locations within the legal corpus, resulting in a dataset of 6,858 query-answer pairs over a corpus of over 79M characters, entirely human-annotated by legal experts. We also introduce LegalBench-RAG-mini, a lightweight version for rapid iteration and experimentation. By providing a dedicated benchmark for legal retrieval, LegalBench-RAG serves as a critical tool for companies and researchers focused on enhancing the accuracy and performance of RAG systems in the legal domain. The LegalBench-RAG dataset is publicly available at https://github.com/zeroentropy-cc/legalbenchrag.

PDF HTML Abstract

LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain

The increasing complexity of applications in the legal domain necessitates the development of more specialized AI benchmarks to evaluate the efficacy of Retrieval-Augmented Generation (RAG) systems. The paper "LegalBench-RAG: A Benchmark for Retrieval-Augmented Generation in the Legal Domain" addresses a critical gap in this space by introducing the first dedicated benchmark designed specifically to assess the retrieval phase of RAG systems applied to legal texts.

Introduction to LegalBench-RAG

The paper delineates the design and implementation of LegalBench-RAG, constructed to measure how effectively retrieval systems can extract highly relevant snippets from voluminous legal corpora. Unlike general benchmarks that often fail to address the specialized structures, terminologies, and requirements of legal documents, LegalBench-RAG focuses on retrieving minimal, precise text segments. This specificity is vital for preventing the inclusion of redundant or irrelevant information that could lead to context window limitations, higher processing costs, increased latency, and risks of hallucination by LLMs.

Dataset Construction and Scope

LegalBench-RAG's construction leveraged LegalBench, a pre-existing benchmark that evaluates the generative capabilities of LLMs in legal contexts. By retracing LegalBench queries to their original sources, the researchers created a robust dataset of 6,858 query-answer pairs, curated over a corpus exceeding 79 million characters. LegalBench-RAG emphasizes citation generation and veracity checks, enhancing practical utility in legal applications.

The benchmark includes four primary datasets:

Privacy QA: Data focusing on privacy policies of consumer apps.
ContractNLI: NDA-related documents.
CUAD: Private contract documents.
MAUD: Mergers and acquisitions documents of public companies.

Quality Control and Limitations

Quality control steps involved manual verification of mappings from annotation categories to interrogatives and document IDs to descriptions. Irrelevant or imprecise categories were excluded to maintain high-quality standards. However, certain limitations persist, such as the exclusion of all existing legal documents and documents requiring multi-hop reasoning across multiple sources.

Benchmarking and Results

The paper conducts extensive benchmarking of RAG systems using LegalBench-RAG, focusing on the retrieval phase. Experiments employed OpenAI's text embeddings, Cohere's reranker model, and various chunking strategies including a Recursive Text Character Splitter (RTCS). Results indicated that the RTCS method outperformed simpler chunking strategies, although general-purpose rerankers like Cohere’s model underperformed, suggesting the necessity for more domain-specific reranking approaches.

Implications and Future Research

LegalBench-RAG presents significant implications for the development and benchmarking of retrieval algorithms in legal AI systems. By offering a standardized evaluation framework, it facilitates more rigorous comparison and iterative improvement of RAG techniques. The benchmark's introduction underscores the importance of domain-specific tools in legal text processing and opens avenues for future research to enhance retrieval accuracy and efficacy. Future research may focus on fine-tuning rerankers for legal contexts and expanding the benchmark to encompass more diverse and complex legal documents.

In sum, LegalBench-RAG fills a critical need in legal AI by providing a dedicated tool to measure retrieval precision and recall, ensuring that RAG systems in the legal domain are both reliable and efficient. This benchmark will be indispensable for researchers and companies aiming to deploy AI solutions within the highly specialized field of legal technology.