Papers
Topics
Authors
Recent
Search
2000 character limit reached

Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs

Published 30 May 2026 in cs.CL and cs.DL | (2606.00898v1)

Abstract: LLMs systematically hallucinate legal citations -- fabricating statute references, citing repealed provisions, and confusing jurisdictions -- yet no automated method exists to measure or reduce this behavior at scale. We propose citation grounding (CG), a metric that verifies LLM-generated legal citations against a ground-truth citation graph extracted from 100.8 million Ukrainian court decisions (502 million edges, 21,736 unique statute nodes). CG decomposes into three components -- citation precision (does the cited provision exist?), citation relevance (is it contextually appropriate?), and citation temporality (was it valid at the relevant date?) -- enabling differential diagnosis of hallucination types. Empirical evaluation on 100 Ukrainian legal queries across five systems -- four commercial LLMs via AWS Bedrock (Claude Haiku 4.5, Mistral Pixtral Large, Amazon Nova Pro/Lite) and one RAG-augmented production system -- reveals CG ranging from 0.791 to 0.873, with 13-21% of citations hallucinated. To reduce hallucinations without human annotation, we introduce Citation Grounding DPO (CG-DPO): a method that constructs preference pairs algorithmically by corrupting verified citations from real court decisions via four targeted strategies. On a dataset of 2,244 court decisions, a Qwen2.5-7B-Instruct model fine-tuned with LoRA achieves 98.5% mean validation accuracy in distinguishing correct from corrupted citations (rewards margin +14.9, std < 0.3 pp across 3 seeds). The citation graph, evaluation framework, and CG-DPO dataset are released as open resources.

Authors (1)

Summary

  • The paper introduces a Citation Grounding (CG) metric that decomposes citation quality into precision, relevance, and temporality components.
  • The paper demonstrates that retrieval-augmented generation significantly improves citation accuracy, with CG scores ranging from 0.791 to 0.873 across various legal domains.
  • The paper presents a scalable CG-DPO alignment method achieving 98.5% mean validation accuracy in discriminating correct citations from corrupted ones.

Problem Statement and Motivation

LLMs exhibit systematic hallucination of legal citations—fabricating statute references, citing repealed provisions, and confusing jurisdictions—posing a severe reliability risk in legal practice. Despite the prevalence and impact of such errors, no scalable, automated evaluation or mitigation methods are available. Existing legal NLP benchmarks (e.g., LEXTREME, LexGLUE, LegalBench) lack generative citation accuracy metrics and focus on classification tasks rather than factual citation grounding. The research addresses this critical gap by leveraging a citation graph extracted from 100.8 million Ukrainian court decisions to both measure and reduce LLM citation hallucinations.

Citation Graph Extraction and Structure

The citation graph G=(V,E)\mathcal{G} = (V, E) consists of bipartite nodes: decisions (VdV_d) and statute articles (VnV_n). With over 502 million citation edges and 21,736 unique statute nodes (type–law–article triples), it constitutes the largest known legal citation graph. The graph spans three distinct legal epochs (2005–2013, 2014–2021, 2022–2026), accommodating changes in statutory validity over time—a key factor for temporality analysis. Statute node extraction attains 100% precision on validation samples via compiled regular expressions.

Citation Grounding (CG) Metric and Decomposition

CG is defined as the proportion of LLM-generated citations verifiable in the ground-truth citation graph. It decomposes into three diagnostic components:

  • Citation Precision (CPCP): Verifies existence of the cited statute article in the legislation corpus.
  • Citation Relevance (CRCR): Assesses whether the citation is contextually appropriate by checking if similar court decisions cite the same article.
  • Citation Temporality (CTCT): Validates whether the statute was in force at the relevant date.

The composite citation quality score strictly penalizes non-existent citations, reflecting legal practice where fabricated citations invalidate documents. The metric enables granular diagnosis of hallucination types—fabrications, outdated norms, jurisdictional confusion, and false argumentation.

Empirical Evaluation Across Systems

Experiments were conducted on 100 Ukrainian legal queries covering seven domains, evaluated across four commercial LLMs (Claude Haiku 4.5, Mistral Pixtral Large, Amazon Nova Pro/Lite via AWS Bedrock) and a RAG-augmented production system (LEX Chat). Citation grounding scores ranged from 0.791 to 0.873, corresponding to hallucination rates of 13–21%. Notably, the RAG system attained the highest accuracy (CG=0.873) at the lowest citation density, demonstrating the effectiveness of retrieval-augmented generation.

The relationship between citation density and accuracy was weak (ρ=0.12\rho = -0.12), contrary to prior studies, and proved architecture-dependent. Domain-level analysis revealed perfect grounding in constitutional law across all models (CG=1.0), high accuracy in criminal law, and significant variance in family and labor law (CG ranging from 0.46 to 0.90), highlighting domain-specific citation predictability and graph coverage gaps.

Qualitative analyses illustrated diagnostic insights from the CG decomposition, differentiating coverage limitations, relevance failures, and perfect grounding.

Citation Grounding DPO: Automated Alignment Without Human Annotation

To mitigate citation hallucinations, Citation Grounding DPO (CG-DPO) constructs preference pairs by algorithmic corruption of real court citations. Four corruption strategies (article swap, law swap, hallucination injection, anachronism) target specific CG components. Using a dataset of 2,244 court decisions, a Qwen2.5-7B-Instruct model fine-tuned with LoRA achieved 98.5% mean validation accuracy in discriminating correct from corrupted citations. Rapid convergence and high rewards margin indicate the structural salience of graph-based corruptions provides a strong gradient signal for alignment.

Discussion and Implications

The research demonstrates that retrieval augmentation (RAG) reliably enhances citation grounding, outperforming model scaling. The coverage-dependent nature of the citation graph means that CG is a conservative metric: it can yield false positives (real citations flagged due to graph gaps), but robustly detects fabricated citations. Expansion of the citation graph to encompass additional legal sources (e.g., ECHR, administrative registries) is expected to improve coverage and reduce false positives.

The CG-DPO methodology offers scalable, annotation-free alignment for factual citation accuracy—especially relevant for low-resource or emerging legal systems. Its focus on citations rather than argumentation ensures targeted alignment without undesirable behavior drift. Domain adaptation for common law systems would require restructuring the graph to reference case law rather than statutes.

Limitations include exclusive reliance on court decisions (limiting coverage), temporal validation restricted to statute existence (not versioning), and scope confined to Ukrainian codified law. End-to-end evaluation of hallucination reduction in open-ended LLM generation remains an avenue for further research, potentially requiring integration with SFT pre-training or alternative alignment protocols.

Conclusion

Citation grounding leverages large-scale legal citation graphs to provide an automated, scalable solution for detecting and reducing LLM citation hallucinations. The metric enables component-wise diagnosis and achieves high accuracy in empirical evaluations. CG-DPO aligns model outputs with real judicial practice without human annotators, achieving strong discrimination between correct and corrupted citations. The approach is generalizable to any legal system with sufficient structured citation corpora and is released as an open resource, facilitating further theoretical and practical advances in trustworthy legal AI (2606.00898).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.