- The paper introduces a Citation Grounding (CG) metric that decomposes citation quality into precision, relevance, and temporality components.
- The paper demonstrates that retrieval-augmented generation significantly improves citation accuracy, with CG scores ranging from 0.791 to 0.873 across various legal domains.
- The paper presents a scalable CG-DPO alignment method achieving 98.5% mean validation accuracy in discriminating correct citations from corrupted ones.
Citation Grounding: Automated Detection and Reduction of Legal Citation Hallucinations in LLMs via Legal Citation Graphs
Problem Statement and Motivation
LLMs exhibit systematic hallucination of legal citations—fabricating statute references, citing repealed provisions, and confusing jurisdictions—posing a severe reliability risk in legal practice. Despite the prevalence and impact of such errors, no scalable, automated evaluation or mitigation methods are available. Existing legal NLP benchmarks (e.g., LEXTREME, LexGLUE, LegalBench) lack generative citation accuracy metrics and focus on classification tasks rather than factual citation grounding. The research addresses this critical gap by leveraging a citation graph extracted from 100.8 million Ukrainian court decisions to both measure and reduce LLM citation hallucinations.
Citation Graph Extraction and Structure
The citation graph G=(V,E) consists of bipartite nodes: decisions (Vd) and statute articles (Vn). With over 502 million citation edges and 21,736 unique statute nodes (type–law–article triples), it constitutes the largest known legal citation graph. The graph spans three distinct legal epochs (2005–2013, 2014–2021, 2022–2026), accommodating changes in statutory validity over time—a key factor for temporality analysis. Statute node extraction attains 100% precision on validation samples via compiled regular expressions.
Citation Grounding (CG) Metric and Decomposition
CG is defined as the proportion of LLM-generated citations verifiable in the ground-truth citation graph. It decomposes into three diagnostic components:
- Citation Precision (CP): Verifies existence of the cited statute article in the legislation corpus.
- Citation Relevance (CR): Assesses whether the citation is contextually appropriate by checking if similar court decisions cite the same article.
- Citation Temporality (CT): Validates whether the statute was in force at the relevant date.
The composite citation quality score strictly penalizes non-existent citations, reflecting legal practice where fabricated citations invalidate documents. The metric enables granular diagnosis of hallucination types—fabrications, outdated norms, jurisdictional confusion, and false argumentation.
Empirical Evaluation Across Systems
Experiments were conducted on 100 Ukrainian legal queries covering seven domains, evaluated across four commercial LLMs (Claude Haiku 4.5, Mistral Pixtral Large, Amazon Nova Pro/Lite via AWS Bedrock) and a RAG-augmented production system (LEX Chat). Citation grounding scores ranged from 0.791 to 0.873, corresponding to hallucination rates of 13–21%. Notably, the RAG system attained the highest accuracy (CG=0.873) at the lowest citation density, demonstrating the effectiveness of retrieval-augmented generation.
The relationship between citation density and accuracy was weak (ρ=−0.12), contrary to prior studies, and proved architecture-dependent. Domain-level analysis revealed perfect grounding in constitutional law across all models (CG=1.0), high accuracy in criminal law, and significant variance in family and labor law (CG ranging from 0.46 to 0.90), highlighting domain-specific citation predictability and graph coverage gaps.
Qualitative analyses illustrated diagnostic insights from the CG decomposition, differentiating coverage limitations, relevance failures, and perfect grounding.
Citation Grounding DPO: Automated Alignment Without Human Annotation
To mitigate citation hallucinations, Citation Grounding DPO (CG-DPO) constructs preference pairs by algorithmic corruption of real court citations. Four corruption strategies (article swap, law swap, hallucination injection, anachronism) target specific CG components. Using a dataset of 2,244 court decisions, a Qwen2.5-7B-Instruct model fine-tuned with LoRA achieved 98.5% mean validation accuracy in discriminating correct from corrupted citations. Rapid convergence and high rewards margin indicate the structural salience of graph-based corruptions provides a strong gradient signal for alignment.
Discussion and Implications
The research demonstrates that retrieval augmentation (RAG) reliably enhances citation grounding, outperforming model scaling. The coverage-dependent nature of the citation graph means that CG is a conservative metric: it can yield false positives (real citations flagged due to graph gaps), but robustly detects fabricated citations. Expansion of the citation graph to encompass additional legal sources (e.g., ECHR, administrative registries) is expected to improve coverage and reduce false positives.
The CG-DPO methodology offers scalable, annotation-free alignment for factual citation accuracy—especially relevant for low-resource or emerging legal systems. Its focus on citations rather than argumentation ensures targeted alignment without undesirable behavior drift. Domain adaptation for common law systems would require restructuring the graph to reference case law rather than statutes.
Limitations include exclusive reliance on court decisions (limiting coverage), temporal validation restricted to statute existence (not versioning), and scope confined to Ukrainian codified law. End-to-end evaluation of hallucination reduction in open-ended LLM generation remains an avenue for further research, potentially requiring integration with SFT pre-training or alternative alignment protocols.
Conclusion
Citation grounding leverages large-scale legal citation graphs to provide an automated, scalable solution for detecting and reducing LLM citation hallucinations. The metric enables component-wise diagnosis and achieves high accuracy in empirical evaluations. CG-DPO aligns model outputs with real judicial practice without human annotators, achieving strong discrimination between correct and corrupted citations. The approach is generalizable to any legal system with sufficient structured citation corpora and is released as an open resource, facilitating further theoretical and practical advances in trustworthy legal AI (2606.00898).