LegalCiteBench: Evaluating Citation Reliability in Legal Language Models

Published 11 May 2026 in cs.CL and cs.AI | (2605.10186v1)

Abstract: LLMs are increasingly integrated into legal drafting and research workflows, where incorrect citations or fabricated precedents can cause serious professional harm. Existing legal benchmarks largely emphasize statutory reasoning, contract understanding, or general legal question answering, but they do not directly study a central common-law failure mode: when asked to provide case authorities without external grounding, models may return plausible-looking but incorrect citations or cases. We introduce LegalCiteBench, a benchmark for studying closed-book citation recovery, citation verification, and case matching in legal LLMs. LegalCiteBench contains approximately 24K evaluation instances constructed from 1,000 real U.S. judicial opinions from the Case Law Access Project. The benchmark covers five citation-centric tasks: citation retrieval, citation completion, citation error detection, case matching, and case verification and correction. Across 21 LLMs, exact citation recovery remains highly challenging in this closed-book setting: even the strongest models score below 7/100 on citation retrieval and completion. Within the evaluated models, scale and legal-domain pretraining provide limited gains and do not resolve this difficulty. Models also frequently provide concrete but incorrect or low-overlap authorities under our evaluation protocol, with Misleading Answer Rates (MAR) exceeding 94% for 20 of 21 evaluated models on retrieval-heavy tasks. A prompt-only abstention experiment shows that explicit uncertainty instructions reduce some confident fabrication but do not improve citation correctness. LegalCiteBench is intended as a diagnostic framework for studying authority generation failures, verification behavior, and abstention when external grounding is absent, incomplete, or bypassed.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper demonstrates that legal LLMs struggle to accurately retrieve judicial citations with scores below 7/100 and exhibit high misleading answer rates (>94%).
It introduces a robust benchmark of 24,000 instances from U.S. judicial opinions to assess tasks like citation retrieval, completion, error detection, and case verification.
Findings reveal that legal-specific pretraining and prompt-based abstention improve error detection only modestly, without resolving core citation retrieval issues.

LegalCiteBench: A Benchmark for Citation Reliability in Legal LLMs

Motivation and Problem Scope

LegalCiteBench addresses a critical and insufficiently explored challenge in legal language modeling: the reliability of citation generation, verification, and authority recovery in closed-book settings. Standard benchmarks in legal NLP frequently focus on statutory reasoning, contract comprehension, or generic legal QA, but they neglect a central requirement of U.S. common-law workflows—precise, reliable citation of precedent. In legal practice, models that generate plausible yet incorrect or fabricated case citations pose substantial risks, with potential professional, ethical, and practical consequences. LegalCiteBench probes the ability of legal-domain and general LLMs to recover, verify, and correctly abstain when requested to produce judicial authorities with no access to external retrieval mechanisms.

Benchmark Construction and Task Design

LegalCiteBench consists of approximately 24,000 instances from 1,000 authentic U.S. judicial opinions, sampled across jurisdictions and years via the Case Law Access Project. The benchmark spans five granular task categories:

Citation Retrieval (Cat1): Models must recover the set of cited judicial authorities from a research question reflecting the context of the original opinion.
Citation Completion (Cat2): Given a partially supplied citation set, models must identify the remaining authorities used in the source opinion.
Citation Error Detection (Cat3): Presented with a legal analysis paragraph containing citations, models must detect and correct any citation errors at the string level.
Case Matching (Cat4-1): Models must match anonymized, identifier-stripped legal scenarios to the underlying judicial decision.
Case Verification and Correction (Cat4-2): Given a legal problem and a candidate citation, models verify correctness or supply the proper source case if the reference is incorrect.

Construction is fully auditable, with LLM-driven intermediate representations (summaries, scenario transformations) and final ground truths anchored strictly in non-generated (opinion-derived) data. Human validation on a significant subset verifies construction quality, citation accuracy, and prompt diversity.

Evaluation Protocol and Metrics

Model outputs are scored using a task- and category-specific LLM-as-judge protocol, primarily implemented via GPT-4o-mini, and scored on a 0-100 scale. Evaluation emphasizes exact citation string matching, correct detection of citation errors, and accurate case identification or verification—stringency is maintained for volume, reporter, and page numbers rather than only case names or semantic plausibility.

A key diagnostic metric introduced is Misleading Answer Rate (MAR): the proportion of low-scoring responses that nevertheless provide concrete—but incorrect—citations or case answers instead of abstaining. High MAR indicates model failure to calibrate uncertainty, leading to the fabrication of apparently confident but incorrect authorities.

Principal Experimental Findings

Citation Recovery is Universally Challenging

All 21 evaluated LLMs—spanning closed (proprietary) and open-source, general and legal-domain pretraining—exhibit negligible performance on exact citation retrieval and completion. The highest Cat1 (retrieval) score is 6.80/100 and no model exceeds 7/100 in either benchmark (Cat1 or Cat2). This persists regardless of scale or explicit legal-domain pretraining. The dominant failure mode involves models producing generally plausible but incorrect or fabricated citations.

Verification and Error Detection Are Relatively Tractable

Verification-style tasks (Cat3, Cat4-2) are markedly easier for current LLMs. For citation error detection (Cat3), best-in-class models (e.g., SaulLM-54B) reach scores above 75/100, indicating substantially greater capacity for auditing existing references. Case verification/correction (Cat4-2) tasks are similarly tractable, with leading models approaching or exceeding 90/100.

Misleading Output is Systematic

MAR is consistently and alarmingly high. For 20 of 21 models, MAR exceeds 94%—models almost always provide a specific citation, even when the recall is near zero. This behavior persists across architectures, family, and pretraining, confirming that current models default to invention over abstention when confronted with uncertainty.

Neither Model Scale nor Legal-Domain Pretraining Resolves the Core Failure

Scaling from ~1B to 70B parameters yields trivial gains in citation retrieval—Llama-3.1-70B achieves 3.82 vs. 1.47 for Llama-3.1-8B. SaulLM-54B, a model with legal-specific pretraining, achieves state-of-the-art performance on auditing/verification tasks, but remains ineffective at citation recall (Cat1: 3.77, Cat2: 4.06) and maintains a high MAR. This suggests that neither parametric memorization nor narrow domain adaptation suffices; the bottleneck is structural, relating to grounding and calibration in the absence of explicit retrieval.

Prompt-Only Abstention Reduces Fabrication, Not Correctness

Prompting models with explicit abstention instructions reduces MAR for some systems (e.g., Qwen3-14B, Llama-3.1-8B-Instruct) but does not improve citation F1 (remains ≈0). The reduction stems entirely from increased abstention, not from accurate retrieval, confirming that calibration interventions alone do not resolve the grounding deficits.

Implications and Prospective Directions

Practical Deployment in Legal Workflows

Current LLMs, even those specialized for legal NLP, cannot be trusted to generate authorities or citations in a closed-book regime. Practitioners must not rely solely on parametric knowledge for precedent retrieval; rigorous retrieval-augmented architectures and explicit citation verification must remain standard. The pervasiveness of plausible hallucinations mandates that any citation generated without external retrieval be independently verified before use in legal analysis.

Theoretical Insights and Model Design

The results empirically distinguish “recognition” from “generation” in legal reasoning for LLMs. While models can often audit or verify when presented with a specific candidate authority, they cannot reliably generate those authorities unaided. This highlights the parametric constraint of LLMs as pattern matchers rather than factual indexers in high-precision, reference-centric legal domains.

Future Research and Benchmark Utility

LegalCiteBench provides a robust, reproducible, and auditable framework for tracking improvements in citation reliability and abstention calibration. Priority avenues for future methodological work include:

Retrieval-Augmented Generation: Integrating fully grounded retrieval from authoritative legal databases to replace parametric recall for citations.
Post-Training Interventions: Developing approaches that explicitly penalize misleading citation generation and promote calibrated abstention.
Extension to Non-U.S. and Civil Law Jurisdictions: Adapting the methodology to structure citation benchmarking across broader legal traditions.
Human-in-the-Loop Evaluation: Scaling expert validation to assess the real-world alignment between model outputs and legal practice.

Conclusion

LegalCiteBench establishes that, in closed-book conditions, current legal-domain and generic LLMs fail to reliably recover, complete, or abstain on judicial authorities, with all models scoring below 7/100 in citation recall and MAR exceeding 94% for nearly all systems. Legal-domain pretraining has marginal impact on retrieval but provides gains in error detection and verification. Prompt-based abstention increases refusal to answer but does not close the correctness gap. LegalCiteBench’s diagnostic rigor highlights the urgent need for retrieval-centric architectures and systematic calibration for reliable, safe deployment of LLMs in professional legal contexts (2605.10186).

Markdown Report Issue