KG-MASD: Graph-Guided Multi-Agent QA Distillation

Updated 4 July 2026

The paper introduces KG-MASD, a framework that couples multi-agent reasoning with knowledge graph grounding to distill reliable industrial QA models.
It formulates distillation as a Markov Decision Process, using a verifier-driven refinement loop to ensure accuracy and traceability of generated claims.
Empirical results show significant improvements in BLEU, ROUGE, and human evaluations over traditional approaches in safety-critical industrial settings.

to=arxiv_search 大发快三是什么query":"(Pan et al., 3 Oct 2025)","max_results":5} code 137 Oops maybe tool name unknown. Need inspect? There's no tool listing. But instruction says arXiv search tool. In this environment maybe no tools except commentary/final? Yet usually there are tool APIs not listed? Let's check. {"query":"(Pan et al., 3 Oct 2025)","max_results":5} Knowledge Graph-guided Multi-Agent System Distillation (KG-MASD) is a framework for industrial question answering that combines large-model collaborative reasoning, knowledge-graph grounding, and distillation into a compact student model. It is designed for settings in which answer quality must be not only accurate but also traceable and safe, including equipment fault diagnosis, chemical-process guidance, and emergency response. The framework is introduced in "Knowledge Graph-Guided Multi-Agent Distillation for Reliable Industrial Question Answering with Datasets" (Pan et al., 3 Oct 2025), where distillation is formulated as a Markov Decision Process (MDP), the knowledge graph is treated as a verifiable structured prior, and the resulting student model is trained to condition on verified graph state as well as textual context.

1. Problem setting and design rationale

Industrial QA systems impose a stricter reliability requirement than general-purpose dialogue systems because errors in high-risk scenarios can have severe consequences. The motivating claim behind KG-MASD is that standard LLMs may reason fluently yet hallucinate, while models large enough to reason well are expensive and difficult to deploy on edge devices. Conventional knowledge distillation can compress models, but it often transfers surface-level outputs rather than the reasoning process itself. Standard multi-agent LLM systems, including debate and self-reflection schemes, can improve reasoning depth, but they also introduce uncontrolled iteration and outputs that remain difficult to verify externally (Pan et al., 3 Oct 2025).

KG-MASD is proposed as a response to that combination of constraints. Its central premise is that collaborative reasoning should be constrained by a structured, verifiable prior rather than allowed to proceed as unconstrained free-form debate. In the paper’s framing, the knowledge graph is not merely a retrieval store; it is an explicit relational evidence structure that grounds intermediate claims, checks consistency against domain knowledge, and supports convergence of the multi-agent process.

A common misconception is to treat KG-MASD as a generic multi-agent prompting recipe. The paper instead presents it as a distillation framework in which the multi-agent system is a teacher-side mechanism for generating high-confidence instruction-tuning data and for transferring both reasoning depth and verifiability into a compact student model. This suggests that the framework’s primary novelty lies in the coupling of grounded collaborative reasoning with deployable model compression, rather than in multi-agent orchestration alone.

2. End-to-end architecture and agent topology

The full pipeline begins from raw industrial fragments and a global knowledge graph. Let the raw industrial corpus be

$\mathcal{C}=\{C_1,C_2,\ldots,C_n\}.$

The method first applies GraphRAG to extract a global knowledge graph

$\mathcal{G}_g=\{(H,R,T)\},$

where $H$ and $T$ are head and tail entities and $R$ is the relation set. This global graph supplies the structured prior used to guide reasoning and verify generated claims (Pan et al., 3 Oct 2025).

The multi-agent system has five roles.

Role	Function
KG Master	Decomposes the query and expands relevant paths using the global KG
Entity Extractor	Hypothesizes entities from the query and retrieved context
Relation Extractor	Hypothesizes relations from the query and retrieved context
KR Distiller	Aggregates outputs into local triples
Verifier	Checks correctness and sends invalid triples back for refinement

The interaction pattern is structured. The KG Master first expands query-related semantic paths from the global graph. The Entity Extractor and Relation Extractor identify candidate entities and relations. The Knowledge Relation Distiller merges them into local triples $(h_i,r_i,t_i)$ . The Verifier then checks those triples and returns invalid ones for refinement until they are judged reliable. The resulting verified triples define a local knowledge graph that is used both for data augmentation and for conditioning the student during distillation.

The paper emphasizes that this is not free-form agent debate. Each proposed reasoning step is expected to correspond to a graph-grounded claim, and the verifier enforces a refinement loop around invalid triples. This yields instruction–input–output triplets aligned with verified local graph structure rather than merely plausible answer text. A plausible implication is that the framework treats verifiability as a generative constraint, not as a post hoc evaluation criterion.

3. MDP formulation and the prior-quality index

KG-MASD explicitly formulates distillation as an MDP:

$\mathcal{M}_\gamma=(\mathcal{S},\mathcal{A},P_\gamma,R_\gamma),$

where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $P_\gamma$ is the transition kernel, and $\mathcal{G}_g=\{(H,R,T)\},$ 0 is the reward function (Pan et al., 3 Oct 2025).

A state is defined as

$\mathcal{G}_g=\{(H,R,T)\},$ 1

with $\mathcal{G}_g=\{(H,R,T)\},$ 2 the query, $\mathcal{G}_g=\{(H,R,T)\},$ 3 the retrieved context, and

$\mathcal{G}_g=\{(H,R,T)\},$ 4

the multiset of self-generated triples. At time $\mathcal{G}_g=\{(H,R,T)\},$ 5, this becomes

$\mathcal{G}_g=\{(H,R,T)\},$ 6

Actions $\mathcal{G}_g=\{(H,R,T)\},$ 7 correspond to choosing or configuring an agent pipeline to propose or validate new triples. The transition kernel maps the current state and chosen action to a distribution over next states. The reward is defined as the likelihood of producing a correct and domain-consistent answer, with the concrete example

$\mathcal{G}_g=\{(H,R,T)\},$ 8

where $\mathcal{G}_g=\{(H,R,T)\},$ 9 is the latent correct answer, $H$ 0 is the encoding of the query, and $H$ 1 denotes the state observation.

A central theoretical quantity is the prior-quality index $H$ 2, defined as normalized conditional mutual information:

$H$ 3

The conditional mutual information is

$H$ 4

The interpretation given in the paper is that $H$ 5 means the triples contain no information about the answer beyond the query and context, while $H$ 6 means the triples fully determine the answer given the query and context.

The paper connects informativeness and decision quality using Blackwell dominance and proper scoring rules. If one observation $H$ 7 Blackwell-dominates another $H$ 8, then the more informative observation cannot be worse for any decision problem, leading to

$H$ 9

For log-loss, higher informativeness lowers conditional entropy $T$ 0 and improves expected reward. In the paper’s argument, this formalizes why verified KG triples function as a beneficial structured prior.

4. Distillation objective, learning dynamics, and data construction

The distillation objective is written as a knowledge-grounded negative log-likelihood:

$T$ 1

The student model $T$ 2 is trained, using LoRA updates, to condition on the query, context, and verified triples (Pan et al., 3 Oct 2025).

The paper then presents a variance-reduction argument under standard smoothness and Polyak-Łojasiewicz assumptions. The stochastic gradient variance is assumed to decrease with $T$ 3:

$T$ 4

and SGD obeys

$T$ 5

The intended conclusion is that better-verified KG priors reduce gradient noise and tighten the final error floor, implying faster and more stable learning.

The dataset construction process is a substantial component of the framework. The authors curate an industrial QA dataset with vertical annotations across eight categories: Transportation, Health, Environment, Equipment, Production, Electricity, Disaster Prevention, and General. The reported distribution is Transportation 6.5%, Health 2.63%, General 39.68%, Environment 2.41%, Equipment 18.42%, Production 5.31%, Electricity 20.17%, and Disaster Prevention 4%.

They curate 37,426 human QA pairs and 15,424 GPT-generated items. The human set is split into 22,510 train, 7,381 test, and 7,535 validation examples. The GPT set is split into 9,366 train, 2,980 test, and 3,078 validation examples. For the unsupervised corpus, they use sentence-BERT embeddings, a cosine similarity threshold of 0.5, a sliding window of $T$ 6, and segment length constraints of at least 2 sentences and at most 512 tokens. The paper states that the encoder can be something like all-MiniLM-L6-v2, yielding 384-dimensional sentence embeddings. These segments are then used in few-shot prompting to generate instruction–input–output triples.

The appendix shows prompt templates for tasks including material property extraction, chemical process analysis, and reaction equation parsing. This indicates that the generated supervision is intended to encode both domain-specific reasoning patterns and graph-grounded factual support.

5. Empirical results, reliability analyses, and ablations

The experimental backbone LLM for the multi-agent system is DeepSeek-V2. The student models are Qwen2-7B and LLama3.1-8B. LoRA hyperparameters are rank 16, alpha 64, learning rate $T$ 7, batch size 64, and 5 epochs. Multi-agent generation uses temperature 0.8 and top-p 0.85. All experiments run on two NVIDIA 3090 GPUs. Evaluation uses BLEU-4, ROUGE-1, ROUGE-2, ROUGE-L, human evaluation, and LLM-as-a-Judge. Baselines include MAD, Self-Reflect, MAPS, Self-Consistency, Vanilla Fine-tuning, In-Context Learning, Zero-shot Reasoning, and Step-by-Step Distillation (Pan et al., 3 Oct 2025).

In the MAS-assisted setting, KG-MASD outperforms all other methods. On LLama3.1-8B, it achieves BLEU-4 66.812, ROUGE-1 65.539, ROUGE-2 51.573, and ROUGE-L 49.524. On Qwen2-7B, the corresponding scores are 68.148, 66.855, 52.605, and 50.474. Compared with MAD, KG-MASD improves on all metrics; on Qwen2-7B it gains roughly +1.71 BLEU-4 over MAD and improves ROUGE-2 by about +2.08.

In the single-model setting, KG-MASD also exceeds Vanilla Fine-tuning, In-Context Learning, Zero-shot Reasoning, and Step-by-Step Distillation. On Qwen2-7B, Step-by-Step Distillation reaches 62.353 BLEU-4 and 42.191 ROUGE-L, whereas KG-MASD reaches 68.148 and 50.474. The paper summarizes the aggregate improvement range as 2.4–20.1% over baselines, depending on the comparison and metric.

Reliability is evaluated directly rather than inferred from text-overlap metrics alone. In a credibility analysis, low-credibility data gives BLEU-4 58.2 and ROUGE-L 41.5, medium credibility gives 61.7 and 45.3, and high credibility gives 66.8 and 49.5. Human evaluation rises from 63.4 to 74.8, and LLM-Judge rises from 64.1 to 76.5. A second analysis on knowledge-graph completeness reports that sparse KG coverage yields BLEU-4 57.4, while comprehensive KG coverage yields 68.1; ROUGE-L rises from 40.2 to 50.5 and human scores from 62.7 to 74.8. The paper uses these results to argue that verification and graph completeness materially affect downstream trustworthiness.

The ablation study on Qwen2-7B separates GlobalKG, LocalKG, and the full pipeline. GlobalKG gives BLEU-4 62.858, ROUGE-1 61.596, ROUGE-2 48.544, ROUGE-L 46.628. LocalKG improves to 64.129 / 62.001 / 49.462 / 47.583. Full KG-MASD reaches 64.607 / 63.436 / 51.707 / 47.857. This supports the claim that both global and local graph components contribute, with the combined configuration performing best.

The paper also reports knowledge-graph enrichment outputs. KG-MASD-KGC produces 2,468 relation triples, 1,160 unique relations, and 3,686 unique entities; KG-MASD-RTE produces 2,454 triples, 1,148 unique relations, and 3,694 unique entities. This indicates that the framework is positioned not only as a QA distillation pipeline but also as a mechanism for graph growth.

6. Relation to adjacent KG distillation work, caveats, and significance

KG-MASD belongs to a broader line of work that uses LLM-generated structured supervision to compress complex pipelines into smaller models. A closely related example is "Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency" (Choubey et al., 2024), which distills a multi-step document-level ontology-free KG synthesis workflow into a one-shot smaller model. The relation is methodological rather than identical: Distill-SynthKG focuses on document-to-KG synthesis and graph-based retrieval for RAG, whereas KG-MASD couples multi-agent reasoning, verified local triple extraction, and student distillation for industrial QA. This suggests a shared research pattern in which expensive teacher workflows are converted into deployable student models through synthetic structured supervision.

Several caveats are stated directly in the paper. Some equations in the loss expression are typographically garbled in the PDF source, and the cleanest formal objective is the MDP-section negative log-likelihood

$T$ 8

The paper also presents a control-theoretic structural controllability discussion for the five-agent system, stating that the system matrix $T$ 9 and controllability matrix

$R$ 0

satisfy full-rank controllability under their topology. That discussion is described as more of a stability justification than an experimentally tested component, although simulations reportedly show trajectories converging over time, with the Verifier stabilizing especially quickly (Pan et al., 3 Oct 2025).

The broader significance of KG-MASD lies in its attempt to jointly transfer reasoning depth and external factual consistency into compact models suitable for edge deployment. The framework’s practical implication, as stated in the paper, is that safety-critical industrial QA should not rely on raw model generation alone; it should incorporate a structured verification layer, and a knowledge graph is presented as a natural way to provide that layer. A plausible implication is that the framework treats compactness, grounding, and reliability as interdependent design constraints rather than sequential optimization targets.

Markdown Report Issue Upgrade to Chat

References (2)

Knowledge Graph-Guided Multi-Agent Distillation for Reliable Industrial Question Answering with Datasets (2025)

Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Knowledge Graph-guided Multi-Agent System Distillation (KG-MASD).