CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation (2406.17186v2)

Published 24 Jun 2024 in cs.CL and cs.CY

Abstract: Legal professionals need to write analyses that rely on citations to relevant precedents, i.e., previous case decisions. Intelligent systems assisting legal professionals in writing such documents provide great benefits but are challenging to design. Such systems need to help locate, summarize, and reason over salient precedents in order to be useful. To enable systems for such tasks, we work with legal professionals to transform a large open-source legal corpus into a dataset supporting two important backbone tasks: information retrieval (IR) and retrieval-augmented generation (RAG). This dataset CLERC (Case Law Evaluation Retrieval Corpus), is constructed for training and evaluating models on their ability to (1) find corresponding citations for a given piece of legal analysis and to (2) compile the text of these citations (as well as previous context) into a cogent analysis that supports a reasoning goal. We benchmark state-of-the-art models on CLERC, showing that current approaches still struggle: GPT-4o generates analyses with the highest ROUGE F-scores but hallucinates the most, while zero-shot IR models only achieve 48.3% recall@1000.

Summary

The paper introduces a large long-context legal dataset, CLERC, for advanced legal case retrieval and analysis generation.
It details the transformation of over 1.84M case documents and 20.7M citations into a resource that boosts fine-tuned legal IR model performance.
The study shows that retrieval-augmented analysis generation can improve legal text accuracy, despite challenges like citation hallucinations.

Clerc: A Large Long-Context Dataset for Retrieval and Reasoning in Legal Text

The paper, authored by Abe Bohan Hou et al., introduces Clerc (Case Law Evaluation and Retrieval Corpus), a substantial dataset designed for the specialized tasks of legal citation retrieval and retrieval-augmented analysis generation. The dataset was developed by transforming an extensive open-source legal corpus, namely the Caselaw Access Project (CAP), into a resource that supports advanced information retrieval (IR) and retrieval-augmented generation (RAG) tasks. This work is particularly significant for AI applications in the legal domain, providing a rigorous evaluation framework and facilitating the creation of intelligent systems to aid legal professionals in drafting legal analyses.

Dataset Construction and Features

The dataset consists of 1.84 million documents from CAP, encompassing over 20.7 million citations and 23.7 million passages for retrieval, along with a specialized subset (\clercg) for legal analysis generation. The dataset includes detailed preprocessing steps to ensure quality and relevance:

CLERC/doc: Full-length case documents.
CLERC/passage: Document chunks for passage-level retrieval.
CLERC/queries: Generated queries with central and non-central citations removed.
\clercg: Passages designated for the generation of legal analysis based on retrieved citations.

By collaborating with legal professionals, the authors have created a comprehensive and high-quality dataset that fulfills the dual purpose of supporting legal IR and RAG tasks.

Legal Citation Retrieval

The authors designed legal citation retrieval tasks where models attempt to find relevant passages or documents cited within a piece of legal text. They categorized queries into direct and indirect, based on whether the citation included a direct quote or not, and performed detailed ablation studies to assess the effects of different query types and lengths. Key findings include:

Performance Evaluation: Models like BM25, transformer-based Late Interaction models, and Bi-Encoders were evaluated. BM25 achieved the highest zero-shot Recall@1K (48.3%) on single-removed queries. Fine-tuned models significantly outperformed baseline models, with LegalBERT DPR achieving 68.5% Recall@1K.
Challenges: The retrieval results indicated that domain-specific nuances in legal texts, such as the presence of common legal terms acting as distractors, severely impacted retrieval quality. Fine-tuning on legal-specific data mitigated these challenges to some extent.

Retrieval-Augmented Analysis Generation

For the generation task, systems were tasked with crafting legal analyses given citations and preceding context:

Evaluation Metrics: The authors proposed and used metrics such as Citation Recall (CR), Citation Precision (CP), and Citation False Positive rate (CFP) to measure the fidelity and relevance of citations in generated texts.
Model Performance: GPT-4o achieved the highest ROUGE F-scores and citation metrics but displayed a high rate of hallucinations. Prompting with cited documents substantially improved performance across models, reflecting the importance of providing comprehensive input data.
Insights: Despite improvements, the models depicted notable limitations in maintaining factual and analytical accuracy. The paper exemplified how high citation metrics do not necessarily equate to robust and reliable legal analyses, emphasizing the complexity of legal domain text generation.

Implications and Future Directions

The development of Clerc opens several avenues for future research and application:

Enhanced Retrieval Models: The findings indicate that fine-tuned retrieval models significantly improve performance. Future work could explore more sophisticated domain adaptation techniques and hybrid retrieval models.
Improved Generation Metrics: The current evaluation metrics, while useful, still overestimate the analysis quality. Developing more nuanced metrics tailored to legal analysis generation can further advance the field.
Broader Legal Applications: While Clerc focuses on federal case law, extending similar methodologies to other types of legal texts (e.g., statutes, regulatory documents) can build comprehensive AI systems applicable to various legal contexts.

In conclusion, Clerc represents a vital resource for advancing AI in legal tech, enabling the development and evaluation of models for complex tasks like long-context retrieval and legal analysis generation. The work underscores the necessity of specialized datasets and tailored evaluation metrics to address the unique challenges posed by legal texts.

PDF Markdown

Related Papers

Tweets

https://twitter.com/abe_hou/status/1808257022053474314

https://twitter.com/abe_hou/status/1844827349017346405

https://twitter.com/joelniklaus/status/1807688511358660711

https://twitter.com/ben_vandurme/status/1882951139332862369

https://twitter.com/WGOV/status/1806612447404388858