Chatty-KG: LLM and KG Integration

Updated 3 December 2025

Chatty-KG is a system that unifies conversational LLMs with structured knowledge graphs, enabling accurate multi-turn dialogue and query answering.
It employs a modular pipeline with agents for query rewriting, entity/relation linking, and SPARQL query generation to ensure context resolution and correctness.
The architecture supports scalable, explainable KG-driven applications in enterprises, recommender systems, and research while adapting in real time to evolving data.

A “Chatty-KG” system unifies the conversational capabilities of LLMs with the structured, compositional grounding of knowledge graphs (KGs), enabling accurate, interactive, and contextually coherent question answering and dialogue over complex graph-structured data. The paradigm encompasses architectures for conversational recommendation, multi-turn KGQA, explainable open-domain chat, and LLM-assisted knowledge graph engineering. Chatty-KG systems are characterized by modular integration of LLM-driven natural language understanding, on-demand KG retrieval or reasoning, and structured query generation, supporting both single-turn and multi-turn, coreference-resolving dialogue, with formal guarantees on correctness, grounding, and extensibility across arbitrary KGs.

1. System Architectures and Core Principles

Chatty-KG systems adopt a modular, multi-agent architecture, orchestrating specialized components to bridge natural language queries and the compositional structure of KGs. The canonical instantiation employs a hierarchical pipeline:

Contextual Understanding: Classifies and rewrites user queries to ensure self-containedness and resolve anaphora (e.g., “When was its first movie released?” → “When was the first Harry Potter movie released?”) using dedicated LLM agents.
Intermediate Representation Extraction: Constructs a Question Intermediate Representation (QIR), typically as a set of semantic triples $\langle s, r, o\rangle$ , augmented with entities $E$ , relation phrases $\mathcal{R}$ , and unknowns $U$ , leveraging chain-of-thought (CoT) LLM prompting.
Entity and Relation Linking: Performs on-demand, runtime SPARQL keyword lookups for candidate nodes and predicates. Linker agents disambiguate references using zero-shot LLM ranking and BERT-based predicate similarity, producing $\mathrm{EntToVertex}$ and $\mathrm{RelToPred}$ mappings.
Query Planning and Execution: Generates candidate triple patterns and associated SPARQL queries, filters predicates via LLM-aided selection, structurally validates, then executes a bounded batch of queries ( $\leq 40$ per turn) to retrieve and aggregate answers.
Conversational Tracking: Maintains a compressed turn-history $\mathcal{D}$ and adapts response generation to preserve multi-turn coherence and handle context dependencies.
Response Synthesis: Optionally reformulates KG answers into fluent natural language outputs. In dialogic recommendation or chit-chat systems, response modules inject graph-entity context vectors into a neural decoder, often using Transformer or encoder-decoder architectures.

This agent composition permits modularity: each agent—e.g., entity linker, query planner, generator—can be independently tuned or replaced without retraining the entire pipeline. Chatty-KG implementations are LLM-agnostic and support interchangeability between commercial and open-weight models, with scaling in answer quality as parameter count increases (Omar et al., 26 Nov 2025).

2. Conversational Context, Dialogue Management, and Coreference

A defining element is explicit multi-turn context integration. Chatty-KG agents maintain a structured memory $\mathcal{D}=\{(q_1,A_1),...,(q_{n-1},A_{n-1})\}$ encoding the sequence of questions and corresponding answers. For each new turn, a Classifier Agent detects if a question depends on prior context, triggering a Rephraser Agent to resolve ellipsis and anaphora using preceding answer snippets ( $C_q=\{(q_k,A_k^L)\}$ , with $A_k^L$ the truncated answer).

In corpus or benchmark settings such as KGConv, data is designed to explicitly reflect conversational context where each turn is grounded in a distinct KG triple but question variants (in-context, synthetic-in-context) invoke pronominalization, ellipsis, or reference to entities surfaced earlier in the conversation (Brabant et al., 2023).

The system architecture thus ensures every downstream module operates on a contextually complete, semantically unambiguous utterance, enabling robust QIR extraction, linking, and precise query construction across arbitrarily long dialog chains (Omar et al., 26 Nov 2025).

3. Natural Language to KG Reasoning: Query Generation and Execution

Chatty-KG systems translate NL questions into compositional KG queries, most commonly SPARQL:

QIR Extraction: LLMs (e.g., GPT-4o, Gemini) produce interpretable semantic sketches—triples, entities, relation phrases—augmented by few-shot or CoT prompting.
Entity Retrieval: Runtime SPARQL label queries retrieve candidate URIs for every mentioned entity, bypassing the need for large embedding indices and facilitating real-time adaptation to evolving KGs.
Predicate Linking: For every relation phrase, candidate predicates are ranked using embedding similarity between user language and KG labels.
Query Generation: The system enumerates all possible SPARQL queries by combining entity and predicate candidates, after which a dedicated agent or LLM prompt filters to the most likely subset, validated structurally before execution.
Answer Aggregation: Execution results are deduplicated, post-filtered (type checking, numeric range), and formatted for the user, with possible LLM-based answer synthesis for improved fluency.
Explainability: In explainable reasoning systems, explicit paths (e.g., sequences of KG triples navigated to arrive at the answer) are available for inspection and can be returned to users as justification (Liu et al., 2019).

Some systems (e.g., KECRS) embed KG structure at the representation level, utilizing R-GCN encoders over domain-constructed KGs, context vectors via self-attention, and multiple auxiliary losses (Bag-of-Entity, Infusion Loss) to enforce entity-awareness and token-entity alignment in language generation (Zhang et al., 2021).

4. Retrieval-Augmented Generation versus Structured KG Execution

Traditional graph-RAG approaches serialize subgraph neighborhoods into text for LLM input, discarding relational structure and incurring substantial latency, embedding costs, and poor support for multi-hop or list queries. Chatty-KG departs from this by:

Bounding All Retrieval: All entity/relation lookups are performed at runtime, eliminating the need for offline embedding indices and allowing for real-time adaptation to KG updates.
Preserving Structure: Direct SPARQL query generation and bounded execution ensure precise, verifiable answers, full compositionality, and faithful exploitation of KG topology.
Hybrid Reasoning: The pipeline retains RAG’s paradigm of on-demand retrieval but couples it with lightweight LLM-driven mapping and fine-tuned bottlenecks (query, candidate, and retry limits) for scalability and latency control (Omar et al., 26 Nov 2025).
Explainable integration: In dialogue generation or recommendation, the system fuses entity-aware graph encodings with Transformer-based language modeling, leveraging KG context at both the representation and generation stages (Zhang et al., 2021, Liu et al., 2019).

5. Corpora, Benchmarks, and Evaluation Metrics

To support and benchmark Chatty-KG systems:

KGConv (Brabant et al., 2023) provides 71k multi-turn dialogs and 604k QA pairs, each grounded in a Wikidata triple and accompanied by up to 12 human/templated/neural question variants. It supports evaluation of context-sensitive question generation, rewriting, and QA, with metrics including BLEU, BERTScore, faithfulness, and naturalness.
Single- and Multi-Turn QA Benchmarks: Standard testbeds include QALD-9, LC-QuAD 1.0, and synthetically generated multi-turn dialogues on DBpedia, Wikidata, DBLP, MAG, and YAGO. Metrics include micro-averaged Precision/Recall/F1, P@1, MRR, and Hit@5.
Human Evaluation: Fluency, coherence, response quality, informativeness, and dialogue management are rated on Likert scales, often with moderate inter-annotator agreement ( $\kappa\approx0.53$ –$0.56$) (Omar et al., 26 Nov 2025).
Ablation Studies: Experiments reveal that disabling context resolution or generating questions independently reduces F1 by up to 15%, while error inspection attributes most residual failures to KG-specific ambiguity in labels or predicates.
Efficiency: On large KGs (e.g., MAG), Chatty-KG achieves average runtimes of ≈3 s per question by issuing orders of magnitude fewer queries than sequential SPARQL execution baselines, with cost per 150 queries below $\$1$ when using GPT-4o (Omar et al., 26 Nov 2025).

6. Applications, Compatibility, and Practical Impact

Chatty-KG architectures are deployable across a diverse set of scenarios:

Enterprise Conversational Agents: Secure integration with private, evolving KGs for compliance, product support, or research without retraining or offline indexing.
Conversational Recommender Systems: Entity-aware recommendation and dialogue over structured item graphs, incorporating explicit supervision at both token and entity-level for diversity and informativeness (Zhang et al., 2021).
KG Engineering and Authoring: LLM partners (e.g., ChatGPT) support SPARQL authoring, ontology visualization, instance data modeling, and iterative debugging within human-in-the-loop pipelines. Code outputs are validated through automated syntactic and semantic checks before user acceptance (Meyer et al., 2023).
Dialogue Generation and Evaluation: Corpora such as KGConv support large-scale training and evaluation of context-dependent question/answer models for KG-grounded dialogue, QA, and question rewriting (Brabant et al., 2023).
Explainable Open-Domain Chat: Multi-hop, reasoning-based knowledge selection with explicit path explanations, leveraging both triple-structured graphs and textual augmentation for open-domain conversations (Liu et al., 2019).

Chatty-KG implementations are model-agnostic (compatible with commercial and open-weight LLMs), multilingual (with translation modules enabling comparable F1 across 11 languages), and robust to KG evolution (hot-swappable to new graphs without preprocessing).

7. Limitations, Open Directions, and Best Practices

Identified limitations and research frontiers include:

QIR Extraction Fragility: Failures arise when KG predicate surface forms diverge from natural language expressions, leading to mislinked queries or retrieval errors.
Ambiguous Entity Names and Linking: Disambiguation can be error-prone, though multiple retrieval and LLM validation steps mitigate this.
Predicate Filtering in Large KGs: Reliance on single-step LLM predicate selection can drop relevant predicates in KGs with tens of thousands of candidates.
Context Limits: LLM prompt windows constrain the number of history turns or schema tokens, necessitating contextual compression.
Human Oversight: System outputs (SPARQL, Turtle, JSON-LD) require parser-based and manual validation, especially in KG engineering scenarios.
Recommendations: Best practices for prompt engineering include explicit vocabulary anchoring, use of concrete examples, iterative refinement, and automated validation loops (Meyer et al., 2023).

Future research directions include incremental learning for entity linking, batch and pattern-based query planning, retrieval-augmented prompting for schema injection, and end-to-end metric development for context-aware QA and generation (Omar et al., 2023, Omar et al., 26 Nov 2025).

References: (Omar et al., 26 Nov 2025, Omar et al., 2023, Zhang et al., 2021, Meyer et al., 2023, Brabant et al., 2023, Liu et al., 2019)