CORE-KG: Legal Knowledge Graph Framework
- CORE-KG is a modular, LLM-driven system for constructing interpretable knowledge graphs from complex, lexically dense legal texts.
- It integrates sequential, type-aware coreference resolution and domain-guided entity extraction to effectively reduce node duplication and suppress legal noise.
- The framework achieves a 33.28% reduction in node duplication and a 38.37% decrease in legal noise, enhancing criminal network analysis and legal analytics.
The CORE-KG Framework is a modular, LLM–driven system for constructing interpretable knowledge graphs (KGs) from unstructured, lexically dense domains such as legal case documents. It is specifically designed to address the challenges of entity disambiguation, coreference, legal noise, and structural integrity in KG construction, with demonstrated use in modeling human smuggling networks. CORE-KG integrates a type-aware coreference resolution pipeline and a domain-guided entity–relationship extraction module within a retrieval-augmented generation (RAG) paradigm, achieving substantial reductions in node duplication and irrelevant noise compared to prior LLM-based or template-based approaches (Meher et al., 20 Jun 2025).
1. System Architecture and Design Principles
CORE-KG is structured as a multi-stage, modular pipeline:
- Type-Aware Coreference Resolution: The initial stage unifies disparate surface forms that refer to the same real-world entity. It sequentially processes specific entity classes (e.g., Person, Location, Organization), using tailored LLM prompts for each type. This resolves both abbreviated and context-dependent references common in legal documents (e.g., mapping “A.Y.” and “the defendant” to a canonical name).
- Domain-Guided Entity and Relation Extraction: Following coreference cleaning, structured LLM prompts extract entities and their relations from the factually dense “Opinion” section of court documents. Seven key entity types are identified: Person, Location, Organization, Route, Means of Transportation, Means of Communication, and Smuggled Items. Extraction is performed using an adapted GraphRAG framework, where overlapping text chunks are processed to yield candidate entity–relation triples.
- Graph Assembly and Post-Processing: The output triples are collated, merged by strict string/type matches, and assembled into a coherent KG using graph libraries such as NetworkX. Explicit post-processing instructions are applied to filter high-frequency irrelevant legal terms and suppress boilerplate noise.
This architecture is underpinned by a strong emphasis on sequential processing, entity-type–specific prompts, and explicit definitions, which together improve entity typing accuracy and the overall coherence of the constructed graph.
2. Type-Aware Coreference Resolution
Legal texts frequently contain surface-level ambiguities and shifting referents, posing significant difficulties for downstream entity extraction. CORE-KG mitigates these challenges via:
- Sequential, Type-Specific Prompting: The system processes one entity type at a time, preventing attention spillover across entity classes and reducing the risk of erroneous merges.
- Structured Coreference Prompts: Prompts are engineered with (a) a high-precision persona, (b) explicit coreference rules (e.g., map all abbreviations to first-introduced full names, eliminate titles), and (c) a strict instruction not to alter the input except for referring expressions.
- Coreference Metric Definition: Node duplication is quantified as:
where indexes clusters of coreferent mentions derived from intra-type fuzzy matching.
Empirically, this methodology reduces the duplication rate by 33.28% relative to a GraphRAG baseline, thus consolidating across-the-text references without loss of critical context (Meher et al., 20 Jun 2025).
3. Entity and Relationship Extraction Using Adapted GraphRAG
After coreference resolution:
- Prompts with Domain-Specific Instructions: Each LLM prompt provides clear definitions for the seven entity types and follows a sequential extraction order. This, combined with explicit negative instructions (e.g., ignore procedural terms, do not extract legalist boilerplate), increases relation extraction fidelity.
- Chunked Text Processing: Input texts are divided into 300-token overlapping chunks to manage long document size and preserve context coherence across coreference boundaries.
- Structured Triple Output: The LLM returns results in a fixed entity–relation–entity format, facilitating reliable assembly of the KG and downstream graph analytics.
- Noise Filtering: Post-processing filters out high-frequency, legally irrelevant terms and duplicates based on stringent intra-type matching.
Using this system, CORE-KG reduces legal noise within extracted KGs by 38.37% compared to the baseline.
4. Empirical Evaluation and Performance Metrics
The efficacy of CORE-KG is demonstrated by two core metrics:
- Node Duplication Rate: CORE-KG achieves a reduction from 30.38% (GraphRAG baseline) to 20.27%, as measured by the duplication rate defined above.
- Legal Noise Rate: The portion of extracted graph components containing boilerplate or procedurally irrelevant information drops from 27.41% to 16.89%.
- Expert-Augmented Validation: Both fuzzy matching and human expert review are utilized to assess the semantic clarity and relevance of the resulting graphs.
These results indicate that sequential, type-specific extraction and prompt engineering directly improve the clarity and structural quality of KGs derived from unstructured, noisy source texts.
5. Analytical and Applied Implications
CORE-KG is immediately applicable to:
- Criminal Network Analysis: By generating cleaner KGs, investigators can more effectively discern underlying structures such as membership hierarchies, control nodes, and key logistics pathways in human smuggling or analogous illicit organizations.
- Legal Analytics: The framework’s ability to suppress irrelevant boilerplate and canonicalize named entities enables rapid identification of relevant legal actors, operational routes, and communication means, supporting investigative and prosecutorial tasks.
- General Unstructured KG Construction: The methodology, while tailored for smuggling networks, generalizes to other contexts requiring systematic extraction from complex narrative texts, provided that domain-specific prompt and entity schema tuning is performed.
6. Limitations and Prospective Enhancements
- Residual Ambiguities: Some coreference ambiguities remain, such as overlapping geospatial references (“border” vs. “United States–Mexican border”) or institution versus nation (“United States government” vs. “United States”).
- Scalability: While the sequential, type-specific prompting strategy is effective in controlling entity conflation, it may introduce computational overhead for domains with extreme entity diversity.
- Adaptive Prompting: The framework relies on extensively tuned prompts; incorporating dynamic or context-adaptive prompts may further improve extraction precision, particularly in domains with rapidly evolving terminology.
- Domain Expansion: Extension to new legal or criminal analysis contexts would require creation or adaptation of entity type schemas and further engineering of prompt strategies.
7. Comparative Evaluation and Broader Context
CORE-KG’s improvements over prior LLM-based methods—specifically, a 33.28% reduction in node duplication and a 38.37% reduction in legal noise—demonstrate the benefits of structured, domain-guided extraction and type-aware sequential processing. By integrating robust coreference resolution and explicit post-filtering into a modular KG construction pipeline, CORE-KG advances the state of interpretable graph construction in complex, linguistically challenging domains (Meher et al., 20 Jun 2025). Its modular design provides a foundation for adaptation in future KG projects requiring high precision, clarity, and explainability.